❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayThe Life of Kenneth

Building the Micro Mirror Free Software CDN

9 May 2023 at 17:47

As should surprise no one, based on my past projects of running my own autonomous system, building my own Internet Exchange Point, and building a global anycast DNS service, I kind of enjoy building Internet infrastructure to make other people's experience online better. So that happened again, and like usual, this project got well out of hand.Β 

Linux updates

You run apt update or dnf upgrade and your Linux system goes off and download updates from your distro and installs them. Most people think nothing of it, but serving all of those files to every Linux install in the world is a challenging problem, and it's made even harder because most Linux distributions are free and thus don't have a project budget to spin up a global CDN (Content Distribution Network) to have dozens or hundreds of servers dedicated to putting bits on the wire for clients over the Internet.

How Linux distros get around this budget issue is that they host a single "golden" copy of all of their project files (Install media ISOs, packages, repository index files, etc) and volunteers around the world who are blessed with surplus bandwidth download a copy of the whole project directory, make it available from their own web server that they build and maintain themselves, and then register their mirror of the content back with the project. Each free software project then has a load balancer that directs clients to nearby mirrors of the content they're requesting while making sure that the volunteer mirrors are still online and up to date.

At the beginning of 2022, one of my friends (John Hawley) and I were discussing the fact that the network who used to be operating a Linux mirror in the same datacenter as us had moved out of the building, and maybe it would be fun to build our own mirror to replace it.

John: "Yeah... it would probably be fun to get back into mirroring since I used to run the mirrors.kernel.org mirrors" (world's largest and most prominent Linux mirror)
Me: "Wait... WHAT?!"


So long story short, the two of us pooled our money together, and went and dropped $4500 on a SuperMicro chassis, stuffed it full of RAM (384GB) and hard drives (6x 16TB) and racked it below the Google Global Cache I'm hosting in my rack in Fremont.

Like usual, I was posting about this as it was happening on Twitter (RIP) and several people on Twitter expressed interest in contributing to the project, so I posted a paypal link, and we came up with the offer that if you donated $320 to the project, you'd get your name on one of the hard drives inside the chassis in Fremont, since that's how much we were paying for each of the 16TB drives.


This "hard drive sponsor" tier also spawned what I think was one of the most hilarious conversations of this whole project, where one of my friends was trying to grasp why people were donating money to get their name on a piece of label tape, stuck to a hard drive, inside a server, inside a locked rack, inside of a data center, where there most certainly was no possibility of anyone ever actually seeing their name on the hard drive. A rather avant-garde concept, I will admit.

The wild part was that we "sold out" on "Hard Drive Sponsor" tier donors, and enough people contributed to the project that we covered almost all of the hardware cost of the original mirror.fcix.net server!

So long story short, we decided to spin up a Linux mirror, fifty of my friends on Twitter chipped in on the project, and we were off to the races trying to load 50TB of Linux distro and free software artifacts on the server to get it up and running. All well and good, and a very typical Linux mirror story.


Where things started to get out of hand is when John started building a Grafana dashboard to parse all of the Nginx logs coming out of our shiny new Linux mirror and analyzing the data as to how much of what projects we were actually serving. Pivoting the data by various metrics like project and release and file type, we came to the realization that while we were hosting 50TB worth of files for various projects, more than two thirds of our network traffic was coming from a very limited number of projects and only about 3TB of files on disk! And this is where the idea of the Micro Mirror began to take shape.

The Micro Mirror Thesis

If the majority of the network traffic on a Linux mirror is coming from a small slice of the assets hosted on the mirror, then it should be possible to build a very small and focused mirror that only hosts projects from that "hot working set" subset and while less effective than our full sized mirror, could be only half as effective as our full size mirror at 10% of the cost.


So we set ourselves the challenge of trying to design a tiny Linux mirror which could pump out a few TB of traffic a day (as opposed to the 12-15TB/day of traffic served from mirror.fcix.net) with a hardware cost less than the $320 that we spent on one of the hard drives in the main storage array. Thanks to eBay and my love for last gen enterprise thin clients, we settled on a design consisting of the following:
This could all be had for less than $250 on eBay used, and conveniently fits nicely in a medium flat rate USPS box, so once we build it and find a random network in the US willing to plug this thing in for us, we can just drop it in the mail.Β 

We built the prototype and one of my other friends in Fremont offered to host it for us, since we're only using the 1G-baseT NIC on-board the thin client, and we were off to the races. Setting the tiny mirror up only hosting Ubuntu ISOs, Extra Packages for Enterprise Linux, and the CentOS repo for servers easily exceeded our design objective of >1TB/day of network traffic. Not a replacement for traditional "heavy iron" mirrors that can host a longer tail of projects, but this is 1TB of network traffic which we were able to peel off of those bigger mirrors so they could spend their resources serving the less popular content, which we wouldn't be able to fit on the single 2TB SSD inside this box.


Now it just became a question of "well, if one Micro Mirror was pretty successful, exactly how many MORE of these little guys could we stamp out and find homes for???"

These Micro Mirrors have several very attractive features to them for the hosting network:
  • They are fully managed by us, so while many networks / service providers want to contribute back to the free software community, they don't have the spare engineering resources required to build and manage their own mirror server. So this fully managed appliance makes it possible for them to contribute their network bandwidth at no manpower cost.
  • They're very small and can fit just about anywhere inside a hosting network's rack.
  • They're low power (15W)
  • They're fault tolerant, since each project's load balancer performs health checks on the mirrors and if this mirror or the hosting network has an outage the load balancers will simply not send clients to our mirror until we get around to fixing the issue.
Then it was just a question of scaling the idea up. Kickstart file so I can take the raw hardware and perform a completely hands-off provisioning of the server. Ansible playbook to take a short config file per node and fully provision the HTML header, project update scripts, and rsync config per server, and suddenly I can fully stamp out a new Micro Mirror server with less than 30 minutes worth of total work.


Finding networks willing to host nodes turned out to be extremely easy. Between Twitter, Mastodon, and a few select Slack channels I hang out on, I was able to easily build a waiting list of hosts that surpassed the inventory of thin clients I had laying around. Then we just needed to figure out how to fund more hardware beyond what we were personally willing to buy. Enter LiberaPay, an open source service similar to Patreon where people can pledge donations to us to keep funding this long term.

So now we have a continual (albeit very small) funding source, and a list of networks waiting for hardware, and it's mainly been a matter of waiting for enough donations to come in to fund another node, ordering the parts, provisioning the server, dropping it in the mail, and waiting for the hosting network to plug it in so we can run our Ansible playbook against it and get it registered with all the relevant projects.


So now we had a solid pipeline set up, and we could start playing around with other hardware designs than the HP T620 thin client. The RTL8168 NIC on the T620s are far from ideal for pumping out a lot of traffic, and we actually got feedback from several hosting networks that they just don't have the ability to plug in baseT servers anymore, and they'd much prefer a 10Gbps SFP+ NIC handoff to the appliance.Β 


The desire for 10G handoffs has been a bit of a challenge while still trying to stay within the $320 hardware budget goal we set for ourselves, but we have been doing some experiments with the HP T620 Plus thin client, which happens to have a PCIe slot that fits a Mellanox ConnectX3 NIC, and we also received a very generous donation of a pile of Dell R220 servers with 10G NICs from Arista Networks (Thanks Arista!)


So now the project has very easily gotten out of hand. We have more than 25 Micro Mirror nodes of various permutations live in production, spanning not only North America but several of the nodes have been deployed internationally. Daily we serve roughly 60-90TB of Linux and other free software updates from these Micro Mirrors, with more than 150Gbps of port capacity. So while not making a profound difference to user experience downloading updates, each Micro Mirror we deploy has helped make a small incremental improvement in how fast users are able to download updates and new software releases.

So if you've started noticing a bunch of *.mm.fcix.net mirrors for your favorite project, this is why. We hit a sweet spot with this managed appliance and have been stamping them out as resources permit.

Interest in Helping?


The two major ways that someone canΒ  help us with this project is funding the new hardware and providing us locations to host the Micro Mirrors:
  • Cash contributions are best sent via my LiberaPay account.
  • Any service providers interested in hosting nodes in their data center network can reach out to mirror@fcix.net to contact us and get on our wait list.
We are not interested in deploying these nodes off of any residential ISP connections, so even if you have a 1Gbps Internet connection from your ISP, we want to limit the deployment of these servers to wholesale transit contexts in data centers where we can work directly with the ISP's NOC.

Of course, nothing is preventing anyone from going out and setting up your own Linux mirror. Ultimately having more mirror admins out there running their own mirrors is better than growing this Micro Mirror project for the sake of diversity. If you're looking to spin up your own mirror and have any specific questions on the process, feel free to reach out to us for that as well.

I also regularly post about this project on Mastodon, if you want to follow along real time.

Unlocking Third Party Transceivers on Older Arista Switches

5 May 2021 at 17:00

Β Like many switch vendors, Arista switches running EOS only accept Arista branded optics by default. They accept any brand of passive direct attach copper cables, but since third party optics tend to cause a prohibitive number of support cases, it's needed to force the user to make a deliberate decision to enable third party optics so they can appreciate that they're venturing out into unsupported territory.Β 

The thing is, if you're out buying an older generation of Arista switch on eBay to try and learn EOS, you're probably also not able to directly buy Arista branded optics, so the unlock procedure on older switches would be of interest to you.Β 

EOS has two different methods for unlocking third party transceivers, depending on what hardware you're trying to unlock. It's not dependent on what version of EOS you're running on the switch, so all switches which originally supported the old "magic file" unlock method always supported the old method, and newer switches have always required the customer unlock code.Β 

  1. The "magic file" method is used on the earlier switches, and depends on EOS checking for a file named "enable3px" in the flash: partition. There doesn't need to be anything in this file; it just needs to exist, so an easy way to create this file from the EOS command line is "bash touch /mnt/flash/enable3px"
  2. The newer "customer unlock code" method instead relies on each customer having a conversation with their account team to get a cryptographic key for their specific account, which is then visible in their running-config as a "service unsupported-transceiver CUSTOMERNAME LICENSEKEY" line
Once you apply one of these two unlock methods, save your running config and reload the switch to have it unlock all your transceivers and accept all future optics you install.Β 

So if you're looking at a switch on the second hand market and trying to figure out if you can unlock it using the old magic file method, here is a list of the switches which predate the newer customer license key method and work with the empty "enable3px" file:
  • DCS-7120T-4S
    • 20x 10GbaseT
    • 4x SFP+
    • Last EOS version 4.13.16M
  • DCS-7140T-8S
    • 40x 10GbaseT
    • 8x SFP+
    • Last EOS version 4.13.16M
  • DCS-7048T-4S
    • 48x 1GbaseT
    • 4x SFP+
    • Last EOS version 4.13.16M
  • DCS-7048T-A
    • 48x 1GbaseT
    • 4x SFP+
    • Last EOS version 4.15.10M
  • DCS-7124S
    • 24x SFP+
    • Last EOS version 4.13.16M
  • DCS-7124SX
    • 24x SFP+
    • Last EOS version 4.13.16M
  • DCS-7124FX
    • 24x SFP+
    • Last EOS version 4.13.16M
  • DCS-7148S
    • 48x SFP+
    • Last EOS version 4.13.16M
  • DCS-7148SX
    • 48x SFP+
    • Last EOS version 4.13.16M
  • DCS-7050T-36
    • 32x 10GbaseT
    • 4x SFP+
    • Last EOS version 4.18.11M-2G
  • DCS-7050T-52
    • 48x 10GbaseT
    • 4x SFP+
    • Last EOS version 4.18.11M-2G
  • DCS-7050T-64
    • 48x 10GbaseT
    • 4x QSFP+
    • Last EOS version 4.18.11M-2G
  • DCS-7050S-52
    • 52x SFP+
    • Last EOS version 4.18.11M-2G
  • DCS-7050S-64
    • 48x SFP+
    • 4x QSFP+
    • Last EOS version 4.18.11M-2G
  • DCS-7050Q-16
    • 8x SFP+
    • 16x QSFP+
    • Last EOS version 4.18.11M-2G
  • DCS-7150S-24
    • 24x SFP+
    • Last EOS version 4.23.x-2G (still an active train)
  • DCS-7150S-52
    • 52x SFP+
    • Last EOS version 4.23.x-2G (still an active train)
  • DCS-7150S-64
    • 48x SFP+
    • 4x QSFP+
    • Last EOS version 4.23.x-2G (still an active train)
  • 7548S-LC line cards on an original 7504/7508
    • 48x SFP+ line cards
    • Last EOS version 4.15.10M
    • None of the E series or R series line cards support the magic file method
So hopefully you find that list helpful.

If you're looking to unlock transceivers on newer switches, I'm afraid I'm not able to help there. You may try filling out the "contact sales" form on the Arista website and have a conversation with your account rep about what you're trying to do. Don't contact TAC or support; they're never authorized to hand out unlock keys and will only refer you to your account team.Β 

Building an Anycast Secondary DNS Service

19 November 2020 at 05:00

Β 

Regular readers will remember how one of my friends offered to hold my beer while I went and started my own autonomous system, and then the two of us and some of our friends thought it would be funny to start an Internet Exchange, and that joke got wildly out of hand... Well, my lovely friend Javier has done it again.

About 18 months ago, Javier and I were hanging out having dinner, and he idly observed in my direction that he had built a DNS service back in 2009 that was still mostly running, but it had fallen into some disrepair and needed a bit of a refresh. Javier is really good at giving me fun little projects to work on... that turn out to be a lot of work. But I think that's what hobbies are supposed to look like? Maybe? So here we are.

With Javier's help, I went and built what I'm calling "Version Two" of Javier's NS-Global DNS Service.

Understanding Secondary DNS

Before we dig into NS-Global, let's talk about DNS for a few minutes. Most people's interaction with DNS is that they configure a DNS resolver like 1.1.1.1 or 9.9.9.9 on their network, and everything starts working better, because DNS is really important and a keystone to the rest of the Internet working.

The thing is, DNS resolvers are just the last piece in the chain of DNS. From the user's perspective, you fire off a question to your DNS resolver, like "what is the IP address for twitter.com?" and the resolver spends some time thinking about it, and eventually comes back with an answer for what IP address your computer should talk to for you to continue your doom scrolling.Β 

And these resolving DNS services aren't what we built. Those are really kind of a pain to manage, and Cloudflare will always do a better job than us at marketing since they were able to get a pretty memorable IP address for it. But the resolving DNS services need somewhere to go to get answers for their user's questions, and that's where authoritative DNS servers come in.

When you ask a recursive DNS resolver for a resource record (like "A/IPv4 address for www.twitter.com") they very well may know nothing about Twitter, and need to start asking around for answers. The first place the resolver will go is one of the 13 root DNS servers (which I've talked about before), and ask them "Hey, do you know the A record for www.twitter.com?", and the answer back from the root DNS server is going to be "No, I don't know, but the name servers for com are over there". And that's all the root servers can do to help you; point you in the right direction for the top level domain name servers. You then ask one of the name servers for com (which happen to be run by Verisign) if they know the answer to your question, and they'll say no too, but they will know where the name servers for twitter.com are, so they'll point you in that direction and you can try asking the twitter.com name servers about IN/A,www.twitter.com, and they'll probably come back with the answer you were looking for the whole time.Β 

So recursive DNS resolvers start at the 13 root servers which can give you an answer for the very last piece of your query, and you work your down the DNS hierarchy moving right to left in the hostname until you finally reach an authoritative DNS server with the answer you seek. And this is really an amazing system that is so scalable, because a resolver only critically depends on knowing the IP addresses for a few root servers, and can bootstrap itself into the practically infinitely large database that is DNS from there. Granted, needing to go back to the root servers to start every query would suck, so resolvers will also have a local cache for answers, so once one query yields the NS records for "com" from a root, those will sit in the local cache for a while and save the trip back to the root servers for every following query for anything ending in ".com".

But we didn't build a resolver. We build an authoritative DNS service, which sits down at the bottom of the hierarchy with a little database of answers (called a zone file) waiting for some higher level DNS server to refer a resolver to us for a query only we can answer.

Most people's experience with authoritative DNS is that when they register their website with Google Domains or GoDaddy, they have some kind of web interface where you can enter IP addresses for your web server, configure where your email for that domain gets sent, etc, and that's fine. They run authoritative DNS servers that you can use, but for various reasons, people may opt to not use their registrar's DNS servers, and instead just tell the registrar "I don't want to use your DNS service, but put these NS records in the top level domain zone file for me so people know where to look for their questions about my domain"

Primary vs Secondary Authoritative

So authoritative DNS servers locally store their own little chunk of the enormous DNS database, and every other DNS server which sends them a query either gets the answer out of the local database, or a referral to a lower level DNS server, or some kind of error message. Just what the network admin for a domain name is looking for!

But DNS is also clever in the fact that it has the concept of a primary authoritative server and secondary authoritative servers. From the resolver's perspective, they're all the same, but if you only had one DNS server with the zone file for your domain and it went offline, that's the end of anyone in the whole world being able to find any IP addresses or other resource records for anything in your domain, so you really probably want to recruit someone else to also host your zone to keep it available while you're having some technical difficulties. You could stand up a second server and make the same edits to your zone database on both, but that'd also be a pain, so DNS even supports the concept of an authoritative transfer (AXFR) which copies the database from one server to another to be available in multiple places.

This secondary service is the only thing that NS-Global provides; you create your zone file however you like, hosted on your own DNS server somewhere else, and then you allow us to use the DNS AXFR protocol to transfer your whole zone file into NS-Global, and we answer questions about your zone from the file if anyone asks us. Then you can go back to your DNS registrar and add us to your list of name servers for your zone, and the availability/performance of your zone improves by the fact that it's no longer depending on your single DNS server.

The Performance of Anycast

Building a secondary DNS service where we run a server somewhere, and then act as a second DNS server for zones is already pretty cool, but not stupid enough to satisfy my usual readers, and this is where anycast comes in.

Anycast overcomes three problems with us running NS-Global as just one server in Fremont:
  1. The speed of light is a whole hell of a lot slower than we'd really like
  2. Our network in Fremont might go offline for some reason
  3. With enough users (or someone attacking our DNS server) we might get more DNS queries than we can answer from a single server

And anycast to the rescue! Anycast is where you have multiple servers world-wide all configured essentially the same, with the same local IP address, then use the Border Gateway Protocol to have them all announce that same IP address into the Internet routing table.Β  By having all the servers announce the same IP address, to the rest of the Internet they all look like they're the same server, which just happens to have a lot of possible options for how to route to it, so every network in the Internet will choose the anycast server closest to them, and route traffic to NS-Global to that server.

Looking at the map above, I used the RIPE Atlas sensor network (which I've also written about before) to measure the latency it takes to send a DNS query to a server in Fremont, California from various places around the world. Broadly speaking, you can see that sensors on the west coast are green (lower latency) in the 0-30ms range. As you move east across the North American continent, the latency gets progressively worst, and as soon as you need to send a query to Fremont from another continent, things get MUCH worse, with Europe seeing 100-200ms of latency, and places like Africa feeling even more pain with query times in the 200-400ms range.

And the wild part is that this is a function of the speed of light. DNS queries in Europe just take longer to make it to Fremont and back thanΒ  a query from the United States.

But if we instead deploy a dozen servers worldwide, and have all of them locally store all the DNS answers to questions we might get, and have them all claim to be the same IP address, we can beat this speed of light problem.

Networks on the west coast can route to our Fremont or Salt Lake City servers, and that's fine, but east coast networks can decide that it's less work to route traffic to our Reston, Virginia or New York City locations, and their traffic doesn't need to travel all the way across the continent and back.

In Europe, they can route their queries to our servers in Frankfurt or London, and to them it seems like NS-Global is a Europe-based service because there's no way they could have sent their DNS query to the US and gotten an answer back as soon as they did (because physics!) We even managed to get servers in San Palo, Brazil; Johannesburg, South Africa; and Tokyo, Japan.

So now NS-Global is pretty well global, and in the aggregate we generally tend to beat the speed of light vs a single DNS server regardless of how well connected it is in one location. Since we're also using the same address and the BGP routing protocol in all of these locations, if one of our sites falls off the face of the Internet, it... really isn't a big deal. When a site goes offline, the BGP announcement from that location disappears, but there's a dozen other sites also announcing the same block, so maybe latency goes up a little bit, but the Internet just routes queries to the next closest NS-Global server and things carry on like nothing happened.

Is it always perfect? No. Tuning an anycast network is a little like balancing an egg on its end. It's possible, but not easy, and easy to perturb. Routing on the Internet unfortunately isn't based on the "shortest" path, but on how far away the destination is based on a few different metrics, so when we turn on a new location in somewhere like Tokyo, because our transit provider there happens to be well connected, suddenly half the traffic in Europe starts getting sent to Tokyo until we turn some knobs and try to make Frankfurt look more attractive to Europe than Tokyo again. The balancing act is that every time we add a new location, or even if we don't do anything and the topology of the rest of the Internet just changes, traffic might not get divided between sites like we'd really like it to, and we need to start reshuffling settings again.

Using NS-Global

Historically, to use NS-Global, you needed to know Javier, and email him to add your zones to NS-Global. But that was a lot of work for Javier, answering emails, and typing, and stuff, so we decided to automate it!Β  I created a sign-up form for NS-Global, so anything hosting their own zone who want to use us as a secondary service can just fill in their domain name and what IP addresses we should pull the zone from.Β 

The particularly clever part (in my opinion) is that things like usernames or logging in or authentication with cookies seem like a lot of work, so I decided that NS-Global wouldn't have usernames or login screens. You just show up, enter your domain, and press submit.Β 

But people on the Internet are assholes, and we can't trust the Internet people to just let us have nice things, so we still needed some way to authenticate users as being rightfully in control of a domain before we would add or update their zone in our database; and that's where the RNAME field in the Start of Authority record comes in. Every DNS zone starts with a special record called the SOA record, which includes a field which is an encoded email address of the zone admin. So when someone submits a new zone to NS-Global, we go off and look up the existing SOA for that zone, and send an email to the RNAME email address with a confirmation link to click if they agree with whomever filled out our form about a desire to use NS-Global. Done.

This blog post is getting a little long, so I will probably leave other details like how we push new zones out to all the identical anycast servers in a few seconds or what BGP traffic engineering looks like to their own posts in the future. Until then, if you run your own DNS server and want to add a second name server as a backup, feel free to sign up at our website!

Validating BGP Prefixes using RPKI on Arista Switches

16 April 2020 at 15:00
I've written another blog post for the Arista EOS blog. This time, it's on how to use the brand new feature in EOS where it now supports RPKI via RTR. I walk through a simple example of how to enable Docker in EOS and spin up a Routinator instance on the switches' loopback interface to use RPKI to validate prefixes from BGP peers.

Unfortunately, Arista does require that you create a free EOS Central account to read blog posts...

A Simple Quality of Service Design Example

1 April 2020 at 16:00

I've written another blog post for the Arista corporate blog! This time, it's on the basics of how to turn on the Quality of Service feature in EOS and start picking out various classes of traffic to treat them better/worse than normal.

EOS Central does require that you create a free login to read blog articles, unfortunately. I'm a little biased, but I'd suggest it might be worth the effort!

Booting Linux Over HTTP

29 March 2020 at 13:00
A couple years ago, one of my friends gave me a big pile of little Dell FX160 thin clients, which are cute little computers which have low power Atom 230 processors in them with the ability to support 3GB of RAM. Being thin clients means they were originally meant to be diskless nodes that could boot a remote desktop application to essentially act as remote graphical consoles to applications running on a beefier server somewhere else.

That being said, they're great as low power Linux boxes, and I've been deploying them in various projects over the years when I need a Linux box somewhere but want/need something a little more substantial than a Raspberry Pi.
The one big problem with them is that they didn't come with the 2.5" hard disk bracket, so I needed to source those drive sled kits on eBay to add more storage than the 1GB embedded SATA drive they all came with. Which is nominally fine; I bought a few of the kits for about $10 a piece, and for that to be the only expense to be able to deploy a 1TB 2.5" drive somewhere has been handy a few times.

But it always left me thinking about what I could do with the original 1GB drive in these things. Obviously, with enough effort and hand wringing, you can get Linux installed on a 1G partition, but that feels like it's been done before, and these are thin clients! They're meant to depend on the network to boot!

Fast forward to this year, and thanks to one of their network engineers hearing my interview for On the Metal, I've been working with Gandi.net to help deploy one of their DNS anycast nodes in Fremont as part of the Fremont Cabal Internet Exchange. The thing is, how they designed their anycast DNS nodes is awesome! They have a 10,000 foot view blog post about it, but the tl;dr is that they don't deploy their remote DNS nodes with an OS image on them. Each server gets deployed with a USB key plugged into them with a custom build of iPXE, which gives the server enough smarts to, over authenticated HTTPS, download the OS image for their central servers and run the service entirely from RAM.

Operationally, this is awesome because it means that when they want to update software on one of their anycast nodes, they can build the new image in advance on their provisioning server centrally, and just tell the server to reboot. When it reboots, it automatically downloads the new image from the provisioning servers, and you're up to date. If something goes terribly wrong and the OS on a node becomes unresponsive? Open a remote hands ticket with the data center "please power cycle our server" and the iPXE ROM will once again download a fresh copy of the OS image to run in RAM.

Granted, they've got all sorts of awesome extra engineering involved in their system; cryptographic authentication of their boot images, local SSDs so while the OS is stateless, their nodes don't need to perform an entire DNS zone transfer from scratch every time it reboots, etc, etc. Which is all well and good, but this iPXE netbooting an entire OS image over the wide Internet using HTTP is just the sort of kind-of-silly, kind-of-awesome sort of project I've been looking to do with these thin clients I've got sitting around in my apartment.

Understanding The Boot Process

This left me with a few problems:

  1. The Gandi blog post regarding their DNS system was a 10,000 foot view conceptual overview, so they rightfully-so glossed over some of the technical specifics that weren't important to their blog post's message but really important for actually making it work.
  2. I have been blissfully ignorant up until now of most of the mechanics involved with Linux booting in the gap between "The BIOS runs the bootloader" and "The Linux kernel is running with your Init server running as PID 1 and your fstab mounted"
  3. I'm trying to do something exceedingly weird here, where there are no additional file systems to mount while the system is booting. There's plenty of guides available on booting Linux with an NFS or iSCSI root file system, but I'm looking at even less than that; I want the entire system just running from local RAM.
So before talking about what I ended up with, let's talk about the journey and what I had to learn about the boot process on Linux.

On a typical traditional Linux host, when you power it on, the local BIOS has enough smarts to find local disks with boot sectors, and read that first sector from the disk and execute it in RAM. That small piece of machine code then has enough smarts to load a more sophisticated bootloader like GRUB from somewhere close on the disk, which then has enough smarts to do more complicated things like load a Linux kernel and init RAM disk to boot Linux, or give the user a user interface to select which Linux kernel to boot, etc. One of the reasons why many Unix systems had a separate /boot partition was because this chainloader between the BIOS and the full running kernel couldn't mount more sophisticated file systems so needed a smaller and simpler partition for just the bare minimum boot files needed to get the kernel running.

The kernel file plus init RAM disk (often called initrd) are the two files Linux really requires to boot, and the part where my understanding was lacking. Granted, my understanding is still pretty lacking, but the main insight I gained was that the initrd file is a packed SVR4 archive of the bare minimum of files that the Linux kernel needs to then go and mount the real root file system and switch to it to have a fully running system. These SVR4 archives can be created using the "cpio" command as the "newc" file format, and the Linux kernel is smart enough to decompress it using gzip before mounting the archive, so we can gzip the initrd file to save bandwidth when ultimately booting the system.

(Related aside; there's many different pathways from the BIOS to having the kernel and initrd files in RAM. One of the most popular "net booting" processes, which I have used quite a bit in the past, is PXE booting, where the BIOS boot ROM in the network card itself has juuuust enough smarts to send out a DHCP request for a lease which includes a TFTP server and file path for a file on that TFTP server as DHCP options, and the PXE ROM downloads this file and runs it. This file is usually pxelinux.0, which I think is another chainloader which then downloads the kernel and initrd files from the same TFTP server, and you're off to the races.)

The missing piece for me inside the initrd file is that the kernel immediately runs a shell script in the root of the filesystem named "/init". This shell script is what switches the root file system over to whatever you specified in your /etc/fstab file, and ultimately at the very end of the /init script is where it "exec /sbin/init" to replace itself with the regular init daemon which you're used to being PID 1 and being the parent of every other process on the system.

I had never seen this /init script before, which is understandable because it's normally not included in your actual "/" root file system! It's only included in the initrd archive's "/" file system (which you can actually unpack yourself using gunzip and cpio), and disappears when it remounts the actual root and exec's /sbin/init... So since I want to run Linux entirely from RAM, "all" I need to do is figure out how to create my own initrd file, generate one that is not a bare minimum to mount another file system but everything I need to run my application in Linux, and figure out a simpler /init script to package with it which doesn't need to mount any local volumes but only needs to mount all the required virtual file systems (like /proc, /sys, and /dev) and exec the real /sbin/init to start the rest of the system.

Generating My Own Initrd File

So the first step in this puzzle for me is figuring out how to generate my own initrd file including the ENTIRE contents of a Linux install instead of just the bare minimum to get it started. And to generate that initrd archive, I first need to create a minimal root file system that I can configure to do what I want to then pack as the initrd file we'll be booting.

Thankfully, Debian has some really good documentation on using their debootstrap tool to start with an empty folder on your computer and end up with a minimal system. The first section of that documentation talks about partitioning the disk you're installing Debian on, but we just need the file system, so I skipped that part and went straight to running debootstrap in an empty directory.

$ sudo debootstrap buster /home/kenneth/tmp/rootfs http://ftp.us.debian.org/debian/

Remember that there's plenty of Debian mirrors, so feel free to pick a closer one off their list.

Once debootstrap is done building the basic image, from a terminal we can jump into the new Linux system using chroot, which doesn't really boot this system, but jump the terminal into it like it was the root of the currently running system, so you can interact with it like it's running. This lets us edit config files like /etc/network/interfaces, apt install needed packages, etc etc. Pretty much just following the rest of the Debian debootstrap guide and then also doing the configuration work needed to set up whatever the system should actually be doing. (things like setting a root password, installing ssh, configuring network interfaces, etc etc)

$ LANG=C.UTF-8 sudo chroot /home/kenneth/tmp/rootfs /bin/bash

Since we're not installing this system on an actual disk, we don't need to worry about installing the GRUB or LILO bootloader like the guide says, but I did install the Linux kernel package since it was the easiest way to grab a built Linux kernel to pair with the final initrd file we're creating. Apt install linux-image-amd64 and copy that vmlinuz file out of the .../boot/ directory in the new filesystem to somewhere handy.

The next step is to place the much simpler /init script in this new file system, so when the kernel loads this entire folder as its initrd we don't go off and try and mount other file systems or anything. This is the part where my friend at Gandi.net was SUPER helpful, since trying to figure out each of the various virtual file systems that still need to be mounted on my own only yielded me a lot of kernel panics.

So huge thanks to Arthur for giving me this chunk of shell code! Copy it into the root of the freshly debootstrapped system and mark it executable (chmod +x)

Source for init:


At this point, we're ready to pack this filesystem into an initrd archive and give it a shot. To create the archive, I followed this guide, which boils down to passing cpio a list of all the file names, and then piping the output of cpio to gzip to compress the image.

$ cd /home/kenneth/tmp/rootfs
$ sudo find . | sudo cpio -H newc -o | gzip -9 -n >~/www/initrd

At this point, you should have this initrd file which is a few hundred MB compressed, and the vmlinuz file (vmlinuz being a compressed version of the usual vmlinux kernel file!) which you grabbed out of the /boot directory, and that *should* be everything you need for booting Linux on its own. Place both of those files on a handy HTTP server to be downloaded by the client later.

Netbooting This Linux Image

Given the initrd and kernel images, the next step is to somehow get the target system to actually load and boot these files. Aside from what I'm talking about here of using HTTP, you can use any of the more traditional booting methods like putting these files on some local storage media and installing GRUB, or using the PXE boot ROM in your computer's network interface to download these files from a TFTP server, etc.

TFTP would probably be pretty cute since many computers can support it stock, but that depends on your target system being on a subnet with a DHCP server that can hand out the right DHCP options to tell it where to look for the TFTP server. I didn't want to depend on DHCP, and I wanted to use HTTP, so I instead opted to use iPXE, which is a much more sophisticated boot ROM than the typical PXE ROMs you get.

It is possible to directly install iPXE on the firmware flash of NICs, but that's often challenging and hardware specific, and a good point that Arthur pointed out was that since they boot iPXE from USB, if for some reason they need to swap the iPXE image remotely, it's *MUCH* easier to mail a USB flash drive and ask them to replace it than to try and walk someone else through how to reflash the firmware on a NIC over the phone... I'm not going to be using a USB drive, since these thin clients happen to have convenient 1GB SSDs in them already, but it's the same image. Instead of dd'ing the ipxe.usb image onto a flash drive, I just temporarily booted Linux on the thin clients and dd'ed the ipxe ROM onto the internal /dev/sda.

The stock iPXE image is pretty generic, and like a normal PXE ROM sends out a DHCP request for a local network boot image to download. This isn't what we want here, so we're definitely going to need to build our own iPXE binary in the end, but I started with the stock ROM because it allows you to hit control-B during the boot process and interactively poke at the iPXE command line, and manually step through the entire process of configuring the network, downloading the Linux kernel, downloading the initrd file, and booting them.

So before building my own custom ROM, I burned iPXE onto a USB flash drive and poked at the iPXE console with the following commands on my apartment network:

dhcp
kernel http://example.com/vmlinux1
initrd http://example.com/init1
boot

And that was enough to start iterating on my initrd file to get it to what I wanted. Since I was still doing this in my apartment which has a DHCP server, I was able to ask iPXE to automatically configure the network with the "dhcp" command, then download a kernel and initrd file, and then finally boot with the two files it just downloaded.

So at this point, I was able to boot the built Linux image interactively from the iPXE console, and had a fully running Linux system in RAM, which was kind of awesome, but I wanted to fully automate the iPXE booting process, which means I need to build a custom image with an embedded "iPXE script" which is essentially just a list of commands for iPXE to run to configure the network interface, download the boot files, and boot.

iPXE Boot Script:

So given that script, we follow the iPXE instructions to download their source using git, install their build dependencies (which I apparently already had on my system from past projects, so good luck...), and the key step is that when performing the final build, we pass make the path to our iPXE boot script file to embed it in the image as what to run.

$ cd ~/src/ipxe/src
$ make EMBEDDED_IMAGE=./bootscript bin/ipxe.usb

And at this point in the ipxe/src/bin folder is the built image of ipxe.usb which has our custom boot script embedded in it! Since the internal SATA disk is close enough to a USB drive, from a booting perspective, that's the variant of ROM I'm using.

So given this custom iPXE ROM, I manually booted a live Linux image on the thin client, used dd to write the ROM to /dev/sda which is the internal 1G SSD, and the box is ready to go!

Now, when I power on the box, the BIOS sees that the internal 1G SSD is bootable, so it boots that, which is iPXE, which runs the embedded script we handed it, which configures the network interface, downloads our custom initrd file and the Linux kernel from my HTTP server, and boots those. Linux then unpacks our initrd file, and runs the /init script embedded in that, which just mounts the virtual file systems like /proc/, /sys/, and /dev, and then doesn't try and mount any other local file system, and finally our /init/ script exec's /sbin/init, which in the case of Debian happens to be systemd, and we're got a fully running system in RAM!

Video of generally what that looks like:


So once again, thanks to Arthur from Gandi.net for the original idea and gentle nudges in the right direction when I got stuck.

Of course, the next thing to do is start playing "disk space golf" with the OS image to see how small I can make the initrd file, since the smaller the initrd file, the more RAM that is left over for running the application in the end!Β  And actually doing something useful with one of these boxes running iPXE... a topic for another blog post.

Update: One thing to note is that this documentation is for the minimum viable "booting Linux over HTTP". iPXE does support crypto such as HTTPS, client TLS certificates for client authentication, and code signing. More details can be found in their documentation.

Building My Own 50Ah LiFePO4 Lithium Battery Pack

19 March 2020 at 01:00
Several years ago, I had purchased a 20Ah 12V Lithium Iron battery pack from Bioenno for my various 12VDC projects. To help protect it, I ultimately built it up into a 50cal ammo can with a dual panel-mount PowerPole connector on the outside, which has proven really nice as far as battery boxes go:

  • 20Ah is a decent battery capacity for a small load
  • The packaged Bioenno pack left some space inside the box to also store the charger it came with, some PowerPole accessories, etc.
  • The fact that you're able to close up the box and use the power connectors on the outside once you're using it is real nice.
  • Bioenno battery packs are well packaged and fairly idiot proof since the battery management/protection board is built into the pack so it's hard to really harm these.
So I really liked that project; cut out the mounting hole for the panel mount, build a small harness to plug the battery pack into, and use hard foam to give the pack further protection inside the box.

I was recently hanging out with one of my electronics buddies while he was browsing eBay, and he found a bulk lot of eight raw 50 amp-hour LiFePO4 cells from a seller for about $400. A good deal, particularly because the seller said these had been purchased for a project which ended up being canceled, so these were practically new off the shelf cells!

The problem was, my friend was only interested in building a 4S 12V pack (4 series cells x 3.5V ~ kind of equals 12V), so since my unsuspecting ass was sitting there, and he was so persuasive, surprise! I now own four raw prismatic 50Ah lithium battery cells.
To give you a sense of scale, each of these cells is about the size of a paperback book, with two studs on the top for the positive and negative terminals. The eBay lot happened to come with the cell spacers and busbars, so all that was missing to build these into a full battery pack was a battery management system module and an enclosure.Β 
What finally convinced me to buy into half of this lot was that, based on the measurements, it looked like I'd be able to exactly fit this homemade pack into another 50cal ammo can, so that solves the enclosure issue...
Well... almost. The cell spacers were literally millimeters too wide to fit in the can, so I did need to shave off the exterior dovetails to make them fit...
One lesson I did learn was that apparently when I bought these ammo cans years ago, I didn't realize that this one was missing several of the hinge pins for the lid. The two that were left were also loose enough that as I started drilling on the case, they fell out. So those had to get epoxied back in, and I now know to never buy ammo cans without checking to make sure all of this hardware is still there.
So let's talk about the battery management system for a second, because that's the important part to understand when building a lithium battery pack out of raw cells.

The LiFePO4 battery chemistry is awesome compared to the typical deep cycle lead-acid which I'd normally use for the same applications as this. Higher energy and power density, much higher cycle life, and lithium batteries tend to have a very flat discharge curve, so you get a useful ~13V out of a 4S pack right up until the moment it's entirely dead, where lead-acid has a long slow decline starting at 12.6V down to below 11V, so some of your DC equipment might start misbehaving even after you've only discharged half of a lead-acid pack's capacity.

The problem with LiFePO4Β compared to lead-acid is that it is also nowhere near as forgiving or abuse tolerant. Granted, LiFePO4Β is better than other lithium chemistries, but you still need to be much more worried about issues like over-discharging or over-charging lithium batteries. Abuse a LiFePO4Β cell, and it's much more prone to experience permanent damage compared to a lead-acid battery.

The battery management system is an electrical module which has a small wire run to both terminals on every cell, so that it can monitor each cell's individual voltage, and even if just a single cell's voltage falls outside of the acceptable range, the BMS can disconnect the whole pack to prevent further damage. The 45A Daly BMS that I bought also supports balancing, so if any one cell has slightly more or slightly less capacity or resistance than the rest, instead of continuing to drift further and further out of sync with the rest of the pack with regards to their charge state, when this BMS sees a cell reach 3.5V before the others, it starts bleeding 30mA off the cell, to slowly pull it back into line with the rest of the pack.

To help understand why this is so important, remember that in something more traditional like a flooded lead-acid battery, when a cell reaches full charge, it can still pass current to charge other cells by splitting water into H2Β and O2. This depletes your electrolyte, but it means the rest of your cells can still come up to full charge. Lithium chemistries don't have any sort of non-catastrophic current leakage like this, so the only way to get current past the highest voltage cell to the others is to use an active device to bleed the current for you.

It also means that LiFePO4Β batteries don't have any concept of a higher voltage balancing charge stage like lead-acid batteries have, so as long as you fully charge the pack to the recommended 14.6V and hold it there long enough for the BMS to finish balancing the pack (Which it shouldn't even need to do as long as the cells start well matched) the pack should be fine. Different charge profiles for different battery chemistries and even different manufacturer recommendations inside the same chemistry is a whole rabbit hole in itself, so for this article I'm going to just leave it at "only use a battery charger specifically designed for LiFePO4Β to charge these" and gracefully move on.
All of the small wires running to each terminal came with my BMS per-terminated on the small connector block which plugs into the BMS module. Since these are just for monitoring cell voltage and balancing cells, they don't need to carry much current. The two big 10AWG wires molded into the BMS are meant to go between the most negative cell terminal in the pack (blue wire) and the charger/load (black wire), so that the BMS is able to disconnect the pack from everything else if it decides that something is wrong.Β 
These high current lines on the BMS were nice. Definitely 10AWG, "ultraflex" so there was more tiny strands which helps with flexibility, and just enough length on them to be useful for what I wanted. The battery line got terminated with a crimp terminal and connected directly to the pack to try and limit resistance.
Β Since the PowerPole panel connectors I'm using can each accept 10AWG themselves, I wanted to split this single 10AWG wire from the BMS to two wires for the load/charger connections.
Β To do this, I used a trick I picked up from my training for the IPC 620 cable assemblies standard, which is when you're trying to butt solder wires together to use a fine gauge wire to lash the wires together first before trying to flow solder in.
All of the strength and conductivity comes from the solder, but you're not trying to hold burning hot wire and molten metal together while soldering thanks to the lashing.

Don't mind the blood on the finer gauge wire... That was because I was an idiot while lashing these wires together and forgot that I was working right next to a fully charged bank of raw battery cells with exposed busbars. The 30AWG wire happened to brush the other end of the pack while I was winding it, and even though it was only 12V, the lack of any current limiting meant the wire instantly turned into a hot knife in my hands and left me with a pretty nasty burn.
Anyways... two layers of heat shrink later, we've got an incredibly good electrical connection between the BMS and two more short pieces of 10AWG wire to come out to the panel connector.
I made a point of keeping all of the wires short, but deliberately put stain relief loops in the routing so the pack could have some ability to move around in the enclosure while being handled without stressing anything, since I'm building this battery pack for field use instead of as a static install.Β 
Several pieces of hard foam to insulate, cushion, and immobilize the pack inside the metal can. Doesn't leave much of any room in this can for DC power accessories like in my first power can, but I figure when I'm using a 50Ah pack instead of my 20Ah pack, I'm probably running a more significant load so it would be worth putting some more planning into exactly what cabling I'll need (or just bring both and have all my accessories plus 70Ah of capacity.)

Ultimately, I'm really happy with how this pack turned out, and it was only about $300 for all the raw components to build a pack which still has all of the protection and balancing features that I'd require in a portable field-use battery pack like the off-the-shelf one that I originally bought from Bioenno.

Building a Raspberry Pi Stratum 1 NTP Server

1 March 2020 at 21:00

After the nice reception my last article on NTP got, I decided that it was about time I pulled a project off the shelf from a few years ago and get a stratum 1 NTP server running on a Raspberry Pi again.

So this article is going to be my notes on how I built one using a cheap GPS module off eBay as the reference time source and a Raspberry Pi. (a 2B in this case, but hopefully the differences for newer models will be minor; if there are any, they're probably around the behavior of the serial port since I think things changed on the Pi3 with bluetooth support?)

So anyways, before we jump into it, let's spend some time talking about why you would and wouldn't want to do this:

Pro: NTP is awesome, and this lets you get a stratum 1 time server on your local network for much lower cost that traditional COTS NTP time standards, which tend to be in the thousands of dollars price range.

Con: This is a Raspberry Pi, booting off an SD card, with a loose PCB plugged into the top of it, so I wouldn't classify it as "a robust appliance". It isn't a packaged enterprise grade product, so I wouldn't rely on this NTP server on its own for anything too critical, but then again, I would say that about NTP in general; it's only meant to get clocks to within "pretty close" of real time. If you're chasing individual milliseconds, you should probably be using PTP (Precision Time Protocol) instead of NTP... Totally depends on what you're doing. I'm just being a nerd.

Pro: The pulse-per-second driver for the Raspberry Pi GPIO pins is pretty good, so once you get this working, the GPS receiver will set the Pi's clock extremely well!

Con: The Raspberry Pi's Ethernet interface is actually hanging off of a USB hub which is hanging off of the USB interface on the SoC (system on chip) that powers the Raspberry Pi, so there is an intrinsically higher level of latency/jitter on the Pi's Ethernet interface vs a "real" PCIe NIC. This means that your bottleneck for getting crazy good time sync on all of your computers is ironically going to be getting the time off of the Raspberry Pi onto the network...

Pro: I would expect this USB jitter to be on the order of 500 microseconds, which is still less than a millisecond, and remember what I said about chasing individual milliseconds in NTP? This should be fine for any reasonable user of NTP.

Conceptual Overview

So we are going to be building an NTP server, which sets its time off of a GPS receiver, which is setting its time off of the global constellation of GPS (et al) satellites, which have cesium atomic clocks on-board which the US government puts a lot of effort into setting correctly, so that is where our time is actually coming from.

I'm specifically using a u-blox NEO-6M GPS receiver, which I got on eBay for a few dollars on a breakout board. This module isn't the "Precision timing" specific NEO-6T variant, but the 6M module is still completely sufficient for the level of time accuracy that we're looking for, and much cheaper/more available than the 6T module which is specifically designed for this sort of static time-keeping application (to the extent where the 6T will even lose GPS lock if you start moving it!)

The NEO-6M GPS receiver, like many GPS receiver modules, has a USB interface, which we will not be using, a UART serial interface, which we will be using, and a "Pulse-per-second" PPS output that it asserts high at the very top of every second, which we will definitely be using!

The reason that we will be using both the UART serial port and the PPS outputs on the GPS is because the UART tells us the current date, wall time, etc, but it does it over a 9600 baud serial port, so the latency between when the module starts sending that report and when we receive all of it will depend on how long the message is, and that depends on things like which NMEA sentences you have turned on, how many satellites the receiver see, etc, so while it usually takes about 130ms to receive this report, you can't be sure exactly how long it was.

The PPS output, on the other hand, is extremely precise on telling us exactly when the top of the second is by taking a single pin and asserting it high right on the top of each second, but all it tells us is when the top of the second is, not which second, or hour, or even what day it is! We want to know both what second we're currently on in history, and exactly when that second starts, so we need both the UART and PPS inputs to correctly set the time.

Looking at conceptually how we're going to use the UART and PPS outputs of the module, we're going to feed both of them into the Raspberry Pi's GPIO header (more on this in the hardware section), and do the following:

  • We're going to use the gpsd daemon to parse the /dev/ttyAMA0 UART text stream, which is usually NMEA sentences, but gpsd might put the receiver into its proprietary binary mode if it can figure out what kind of GPS receiver it is. The gpsd developers have put a lot of work into writing a GPS data stream parser, so we want to benefit from it.Β 
  • Gpsd will then write the time information it gets from the UART to a shared segment of memory, which the NTP daemon is able to connect to using its shared memory segment driver, to get a rough idea of what date/time it is.
  • The PPS output of the receiver will be fed into the Linux kernel PPS driver, which watches for the positive edges on the pin and time stamps them relative to the local system clock. The Linux kernel makes these events available for other applications via the /dev/pps0 device, which NTP will read using the NTP PPS Clock Discipline driver.Β 

Required Hardware

For this project, I used a generic GPS breakout board off eBay, which I soldered to a piece of 4x6cm perf prototype board, and soldered wires to make the required connections. That being said, there are several Raspberry Pi GPS "hats" available off the shelf, and it will likely be much easier for you to just buy the one part from someone like AdaFruit instead of trying to build your own and chase both hardware issues and software issues at the same time when trying to get this working. But hey, do what you want; I'm not the boss of you. I just built my own because I happened to have all of these parts on hand already.
The one key difference depending on which GPS hat you use or how you build your own, is which of the many GPIO pins the GPS PPS pin is attached to. Popular PPS pins are GPIO4 (pin 7) and GPIO18 (pin 12), but I suspect that there's no reason most of the GPIO pins can't support the PPS input, so if you're building your own board, you can pick a different GPIO pin all together. The only thing you need to make sure is that the physical GPIO pin that you use matches the GPIO pin that you configure in the /boot/config.txt file when you're enabling the PPS overlay driver. I used GPIO18 on my build.

Nevermind the LEDs on the side of the perf board; those were left over from the last project I did on this perf board, and I figured they might come in handy if I wanted to program some status indicators for this NTP server.

The Raspberry Pi UART is tied to pins 8 and 10 on the header, so the UART should work regardless of what HAT you buy. What might vary is the exact baud rate. My modules happened to run at 9600 baud, but I've worked with GPS receivers that by default use 4800, 19200, or even 115200. Check the documentation for your receiver, and GPSD does a good job of trying to figure these things out on its own as well so it might not even matter.

So to summarize:

  • 5V on Pi header to Vcc on GPS module
  • GND on Pi header to GND on GPS module
  • TX on Pi header to RX on GPS module
  • RX on Pi header to TX on GPS module
  • GPIO18 on Pi header to PPS on GPS module

Initial Raspberry Pi Setup

I'm going to mostly gloss through this part, because setting up Raspbian on a Raspberry Pi is a pretty well documented process which greatly depends on what OS you're trying to do it on, and if you're trying to use something like NOOBS or write a raw IMG file directly onto an SD card.

I will just say that my preferred method of creating a new Raspbian SD card from my Linux desktop is using the "dd" command, so assuming that my SD card reader came up as /dev/sdf (check the last few lines in "dmesg" to check") I do the following:


kenneth@thor:~/data/operatingSystems/rasberrypi$ sudo umount /dev/sdf*
umount: /dev/sdf: not mounted
kenneth@thor:~/data/operatingSystems/rasberrypi$ sudo dd if=./2019-07-10-raspbian-buster-lite.img of=/dev/sdf
4292608+0 records in
4292608+0 records out
2197815296 bytes (2.2 GB, 2.0 GiB) copied, 843.348 s, 2.6 MB/s
kenneth@thor:~/data/operatingSystems/rasberrypi$ sync

I should probably download a more recent image of Raspbian Lite... It all sorts itself out when I eventually run "sudo apt update; sudo apt upgrade -y".

Once you're created a Raspbian SD card, put it in your Pi, and boot it. I was having some difficulty getting the Pi to boot initially with the GPS hat plugged in (likely since by default the serial port is configured as a login console, and Linux didn't like the GPS receiver spewing a bunch of NMEA sentences into it), so I left the GPS unplugged until I got a chance to disable the login console on the serial port and reboot the Pi.Β 

Install SD card in Pi, plug in monitor, keyboard, Ethernet, and power, and watch Raspbian automatically grow the filesystem to fill the card, and boot. Login with the default "user: pi" "password: raspberry" and do my usual new Raspbian host setup using the raspi-config tool (most of which probably isn't critical to getting NTP working):

From the command line, run "sudo raspi-config" to open the Raspbian config tool. Up/down arrows to navigate, enter to select, right arrow to jump down to the select/finish options.

1 Change User Password -- Pick a new password so you can later enable SSH
2 Network Options - N1 Hostname -- Pick a meaningful hostname for the Pi, like... ntp1pi... or something...
3 Boot Options - B2 Wait for Network at Boot -- I usually turn this off for all my projects, but it likely doesn't matter for a network time server...
4 Localisation Options - I1 Change Locale -- I scroll down and turn off en_GB.UTF-8 and turn on en_US.UTF-8 with the space bar. Page up/down are your friend on this dialog. I pick C.UTF-8 as my default locale. This change takes some time while Raspbian generates the new locale.Β 
4 Localisation Options - I2 Change Timezone -- I go in and pick my local timezone.
4 Localisation Options - I3 Change Keyboard Layout -- This one is actually pretty important! Mainly because the default GB keyboard layout moves around a few keys like the | and @ I think? The one issue I have is that if I try and open this menu now, it displays as gibberish, so I need to reboot the Pi again before making this change.
4 Localisation Options - I4 Change Wi-Fi Country -- Doesn't matter on the Pi2, but on Pi3 and above, I set the WiFi to us.
5 Interfacing Options - P2 SSH -- I enable SSH, because this initial set up is both the first and the last time I plan to have a monitor and keyboard plugged into the Pi. I'll log in remotely over SSH going forward.
5 Interfacing Options - P6 Serial -- This is the important setting toΒ change in here for this project! You would not like a login shell to be accessible over serial (answer no) but you would still like the serial port hardware to be enabled (answer yes)
7 Advanced Options - A3 Memory Split -- Since I'm never going to run a GUI on this Pi, or even ever plan to plug a monitor into it, I like to turn the GPU RAM down to the minimum of 16MB to free up a little more RAM for Linux itself.

When you right arrow to finally select "finish", it will probably prompt you to reboot, and you probably should, particularly because you need to do so to come back in to raspi-config a second time to change the keymap.

SUMMARY: So this was a bunch of my usual quality of life settings on new Raspbian images, and the one crucial step of turning off the login shell on the serial port, but leaving the serial port itself enabled.

Configuring GPSD

At this point, plug in the GPS hat.Β 

The GPS receiver will take some time to gain a lock and start reporting valid location/time information (Ranging from 15 seconds to 20 minutes, depending on how recently the receiver last had a fix, how good its skyview it, etc), but most receivers will still spew out NMEA sentences, even when they don't have a lock.Β 

To perform our first sanity check, I installed minicom (sudo apt install minicom) and used it to open the /dev/ttyAMA0 serial port at 9600 baud to confirm that I had the GPS wired up correctly and was getting at least something that looked like this:

$GPRMC,064605.00,V,,,,,,,,,,N*7C
$GPVTG,,,,,,,,,N*30
$GPGGA,064605.00,,,,,0,00,99.99,,,,,,*67
$GPGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99*30
$GPGSV,1,1,01,29,,,35*75
$GPGLL,,,,,064605.00,V,N*4B

At this point, we can install gpsd and configure it to handle the /dev/ttyAMA0 serial port!

pi@ntp1pi:~ $ sudo apt install gpsd

The GPSD daemon doesn't have a configuration file, but you pass it a few options via the /etc/defaults/gpsd file to tell it what you want.

pi@ntp1pi:~ $ sudo nano /etc/default/gpsd

We need to change two things in this file:Β 
  1. Add "/dev/ttyAMA0" to the empty DEVICES list
  2. Add the "-n"Β flag to the GPSD_OPTIONS field so GPSD will always try and keep the GPS receiver running. When gpsd is running on something like a phone, it makes sense to try and minimize when the GPS is running, but we don't care. I think GPSD also doesn't count NTP as a client, since NTPd is using the shared memory segment to talk to GPSD, so GPSD will randomly just stop listening to the receiver if you don't add this flag.
This is what my /etc/defaults/gpsd file looks like in the end:
# Start the gpsd daemon automatically at boot time
START_DAEMON="true"
# Use USB hotplugging to add new USB devices automatically to the daemon
USBAUTO="true"
# Devices gpsd should collect to at boot time.
# They need to be read/writeable, either by user gpsd or the group dialout.
DEVICES="/dev/ttyAMA0"
# Other options you want to pass to gpsd
GPSD_OPTIONS="-n"

To apply these changes, restart GPSD:

pi@ntp1pi:~ $ sudo service gpsd restart

To perform a sanity check that this configuration is working, we can install the gpsd clients (sudo apt install gpsd-clients) and use one of them like "gpsmon" or "cgps" to see if GPSD is even talking to the receiver, and hopefully reading good time/location information if you've left it running long enough already. You can also run "sudo service gpsd status" and should see a green dot that indicates that systemd has successfully started gpsd. (q quits out of that display)

Configuring the PPS Device

The next step is to tell the Linux kernel which GPIO pin we've attach the PPS output from the GPS receiver to, so Linux can capture the pulse events and timestamp them. This is done by editing the "/boot/config.txt" file, and adding a new line with the driver overlay for pps-gpio (I usually add it at the end of the file), and specifying which GPIO pin you used at the end of the line (in my case, 18):

dtoverlay=pps-gpio,gpiopin=18

At this point, you will have to reboot again to apply this change to the /boot/config.txt file.Β 

Once the Pi has rebooted, you should see a new /dev/pps0 device, and can test it using the ppstest tool (sudo apt install pps-tools):

pi@ntp1pi:~ $ sudo ppstest /dev/pps0
trying PPS source "/dev/pps0"
found PPS source "/dev/pps0"
ok, found 1 source(s), now start fetching data...
source 0 - assert 1583046422.047708052, sequence: 66 - clearΒ  0.000000000, sequence: 0
source 0 - assert 1583046423.047205387, sequence: 67 - clearΒ  0.000000000, sequence: 0
source 0 - assert 1583046424.046730740, sequence: 68 - clearΒ  0.000000000, sequence: 0
source 0 - assert 1583046425.046282831, sequence: 69 - clearΒ  0.000000000, sequence: 0
source 0 - assert 1583046426.045858285, sequence: 70 - clearΒ  0.000000000, sequence: 0

If you instead see error messages about timeouts, it's possible the GPS receiver hasn't gained GPS lock yet, since many won't output PPS until they have a lock, or you've got a problem with your connections between the GPS and your selected GPIO pin...

One thing to note about the output of ppstest is that it timestamps each PPS event with the local time down to the nanosecond resolution. If you notice in the output above, each pulse seems to be happening 500us sooner than the pulse before, which shows that the local system clock speed is grossly off, since pulse per second events should be happening... once per second... not once per every 0.9995 seconds! Once you get ntpd running and disciplining the local clock off of this PPS, you should see the assert timestamp get close to all zeros after the decimal point, which means that your system clock is well aligned to the actual top of the second. So once we're all done here, it should look something like this:

pi@ntp1pi:~ $ sudo ppstest /dev/pps0
trying PPS source "/dev/pps0"
found PPS source "/dev/pps0"
ok, found 1 source(s), now start fetching data...
source 0 - assert 1583098783.999996323, sequence: 46821 - clear  0.000000000, sequence: 0
source 0 - assert 1583098784.999995254, sequence: 46822 - clear  0.000000000, sequence: 0
source 0 - assert 1583098785.999995642, sequence: 46823 - clear  0.000000000, sequence: 0
source 0 - assert 1583098786.999994935, sequence: 46824 - clear  0.000000000, sequence: 0
source 0 - assert 1583098787.999994174, sequence: 46825 - clear  0.000000000, sequence: 0

Notice how the local time stamps relative to the PPS input is pretty dang extremely close to once per second, near the top of each second!

Configuring NTP

At this point, we should have both GPSD and the kernel PPS driver pulling information from the GPS receiver, so now we need to install the NTP server and edit its config file to tell it to use both of these time sources!

pi@ntp1pi:~ $ sudo apt install ntp

One thing that is a little unusual about NTP is that for local time sources, it's about the only system I've ever seen that takes advantage of the fact that the IPv4 loopback address space is an entire /8 (127.0.0.0/8), so each different type of time source, and each instance of each time source, is actually defined by a different 127.127.x.x IP address!

Looking at the NTP documentation for time sources, the two that we are interested in are the PPS clock discipline (driver 22) and the shared memory driver (driver 28).Β 

Since we are interested in using the 0th PPS device (/dev/pps0), the server address we want for the PPS clock source is 127.127.22.0. Likewise, for the SHM(0) driver, we want address 127.127.28.0. Notice how the third octet in both of these IPv4 addresses correspond to the number of the driver we're using, and the fourth octet is the instance of that driver that we're using (the 0th/first for both). For example, if we were monitoring multiple PPS devices for some reason, we could configure multiple server addresses: 127.127.22.0 for /dev/pps0, 127.127.22.1 for /dev/pps1, 127.127.22.2 for /dev/pps2, etc. For this blog post, we're just looking at one of each...

We also need to configure a few flags for each of these time sources, so the new chunk of text we're adding to /etc/ntp.conf should look like this (thanks to this blog post for this snippet):

# pps-gpio on /dev/pps0
server 127.127.22.0 minpoll 4 maxpoll 4
fudge 127.127.22.0 refid PPS
fudge 127.127.22.0 flag3 1Β  # enable kernel PLL/FLL clock discipline

# gpsd shared memory clock
server 127.127.28.0 minpoll 4 maxpoll 4 preferΒ  # PPS requires at least one preferred peer
fudge 127.127.28.0 refid GPS
fudge 127.127.28.0 time1 +0.130Β  # coarse offset due to the UART delay

The first half configures the pulse per second device:
  • The minpoll 4 maxpoll 4 options on the server line tell NTP to always poll the PPS device every 16 seconds (2 raised to the power of 4) instead of the default "start at 64 second intervals and back off to 1024 second intervals" that ntpd uses by default, since we're not sending queries to remote NTP servers here! We're just looking at events from a local piece of hardware.
  • The "fudge 127.127.22.0 refid PPS" line assigns a human readable identifier of ".PPS." to this time source. Again, if you were doing something squirrely like monitoring multiple PPS devices (i.e. "PPS1", "PPS2", etc), or just wanted to assigned a different name to this server than "PPS", you could change that here.
  • The "fudge 127.127.22.0 flag3 1" line enables the kernel Phase Locked Loop clock discipline... which is about all I can say about it... It sounds important!
  • Same thing for the "server 127.127.28.0 minpoll 4 maxpoll 4 prefer" line with regards to the minpoll/maxpoll options; query the shared memory driver every 16 seconds. The "prefer" option tells ntpd to prefer this time source, which according to the comment seems to be required, but I don't quite follow why, since this is a stratum 0 time source, so I'd expect ntpd to end up preferring it anyways.
  • Again, "fudge 127.127.28.0 refid GPS" is assigning a human readable refid to this time source, which in this case is ".GPS." to indicate that this is the NMEA data over the serial port, vs the PPS coming in over the GPIO pin.
  • The oddest line is probably "fudge 127.127.28.0 time1 +0.130" which adds a 130ms offset to the exact time reported from the UART. Remember how I said that the precise time from the UART tends to be pretty bad, since it takes a while to deliver the data over the serial port at 9600 baud, and the exact length of the message will vary second to second? This 130ms offset is a crude approximation of how long it takes to send the NMEA report on the second, so that this clock will at least not be grossly off from true time. You will still see a few ms offset, and plenty of jitter, but at least the offset won't be huge!
So given this chunk of configuration, we add that to /etc/ntp.conf. Granted, even though we're setting up a stratum 1 time server here, it will likely still be wise to leave some other NTP servers in the configuration, so in case our GPS receiver dies for some reason, our server can still get its time from another server and will simply increment its stratum from 1 to one more than whichever other server it has selected as its system peer. Why the stock Raspbian ntp.conf comes with four pools configured (the pool gets expanded into multiple servers, so I don't think you need four of them) is beyond me...

My /etc/ntp.conf file ends up looking like this!
# /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help

driftfile /var/lib/ntp/ntp.drift
leapfile /usr/share/zoneinfo/leap-seconds.list
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable

pool 0.debian.pool.ntp.org

# pps-gpio on /dev/pps0
server 127.127.22.0 minpoll 4 maxpoll 4
fudge 127.127.22.0 refid PPS
fudge 127.127.22.0 flag3 1  # enable kernel PLL/FLL clock discipline

# gpsd shared memory clock
server 127.127.28.0 minpoll 4 maxpoll 4 prefer  # PPS requires at least one preferred peer
fudge 127.127.28.0 refid GPS
fudge 127.127.28.0 time1 +0.130  # coarse offset due to the UART delay

# By default, exchange time with everybody, but don't allow configuration.
restrict -4 default kod notrap nomodify nopeer noquery limited
restrict -6 default kod notrap nomodify nopeer noquery limited

# Local users may interrogate the ntp server more closely.
restrict 127.0.0.1
restrict ::1

# Needed for adding pool entries
restrict source notrap nomodify noquery

To apply this new configuration, you need to tell Linux to restart ntp:

pi@ntp1pi:~ $ sudo service ntp restart

Checking Our Work

So now the BIG question is if this all actually worked. I didn't see any signs of life right away, so I did try rebooting the Pi, which might be required.

The tool you can use to interrogate the local NTP server with regards to what peers it's monitoring and what offsets it has calculated is "ntpq".
Here you can see the output of the "ntpq -p" command. Notice how the reach for the SHM(0) remote is no longer zero! This might take a few minutes, and once the NTP server can reach the shared memory segment, it will wait a few more minutes before it starts also polling the PPS, so don't freak out if you don't start seeing that reach value incrementing as well. It seems to typically take my server about 5-10 minutes of just monitoring the SHM(0) source before it starts also querying the PPS(0) source. If after 15-30 minutes you still see a 0 reach for both local clocks, you should investigate all of the sanity checks we did above (is the pps event being seen by the kernel, is gpsd running and happy, etc)
After allowing my NTP server to run for several hours, we can see how the PPS(0) offset and jitter have gone to essentially zero. The SHM(0) offset and jitter are to be expected, since like I said, that precise timing is based off how long it takes to read the data over the serial port.

And with that, we have a working NTP server! The last time I built one of these, I didn't have access to a cabinet in a datacenter, so I will need to play around with this and see how it performs under load, but that's another project on its own...

Introduction to Network Time Protocol

25 February 2020 at 23:00

I guess before I start this, I should mention that I am no longer funemployed, and now work for Arista! Arista primarily sells Ethernet switches, and after working with their products so much for the last few years for my Internet Exchange project, it was a pretty easy sell for them to convince me to come over and join their customer engineering team, so I now get to spend my day solving customer problems.

As part of that, I can now write blog posts for Arista's blog! Granted, you need a (free) EOS Central login to view it... but my first article covering the basics of the Network Time Protocol is live!

Teardown of a Sonoff Basic WiFi Relay

9 January 2020 at 21:00
Video:


The Sonoff Basic is a low cost IoT WiFi connected relay which you can use to switch things like lights or small fans. The Sonoff is particularly hackable, since it brings out all the pins needed to reflash the firmware and add remote buttons/indicators. Third party firmware images like Tasmota (https://tasmota.github.io/docs/) allow you to directly control these using MQTT or as a Hue device.

Sonoff Basics are available on Amazon (https://amzn.to/2uwKYea) as well as many other online retailers. Sonoff (https://sonoff.tech/) also has many other form factors of relays other than the Basic supported by Tasmota.

Building a Storage Shelf for All My Digital Storage

1 October 2019 at 17:00
I've been using a Synology DS1813+ NAS (modern equivalent is the DS1819) in my apartment for the last few years since one of my friends gave it to me. It's a little pricey, but if this one rolled over and died, I would definitely replace it with another; once you stop wanting to spend your time mucking with the infrastructure in your life, the "just works" experience of the Synology is pretty attractive.

Unfortunately, what hasn't been very attractive is the dusty corner it's been living in. I finally got tired of blowing the dust out of it regularly and at least stood it up on a milk crate, but this could be improved. At the same time, my apartment has been slowly collecting other spare hard drives on various horizontal surfaces, as leftovers from various different projects and the like.

So I decided this would make for another good woodworking project this summer, so I decided to make a storage shelf... for my storage media... it's a storage shelf.
Might I saw, this turned out pretty gorgeous. It was made from about a third of a sheet of 3/4" AB grade plywood with a red oak veneer on it. All of the joints were fairly regular rabbet or dado joints cut to fit the mating piece of plywood. The left shelf was specifically sized to fit my UPS, which I do regret a little since I didn't consider how warm it would get in such a tight space, and the right two shelves were sized specifically to hold two rows of 3.5" hard drives, so the shelf can hold 28 cold drives, in addition to the eight drives running in the Synology itself. (For those wondering, I'm running it with 7x4TB drives in RAID6, and a 128GB SSD read cache drive in the 8th bay)
The finish was three coats of semi-gloss polyurethane, which took longer than the actual wood cutting since you need to let each coat dry and re-sand it again with something fine like 400 grit for the next coat, and I could only work on this the few days I dropped by my parent's house to work in the woodshop and also wasn't creating dust.

Over all, it ended up taking about six weeks to get it all the way from concept to finished. I was tweeting about it and my other various projects quite a bit on my Twitter, which has really been my outlet for a lot of these sorts of posts lately.

Organizing My Tubes of Integrated Circuits

24 September 2019 at 17:00
Like many electronics hobbyists, I've been slowly collecting tubes of integrated circuits and LED displays that come in plastic tubes. Sadly, this collection has literally spent the last ten years as a slowly growing pile of tubes sitting on the floor of my lab. One of those problems that really sneaks up on you, unfortunately.

I decided to try and get back into the woodworking hobby this summer, so I decided to make a small set of vertical cubbies for my IC tubes, and used it as an excuse to practice my dovetail joinery, since a storage rack made out a cheap piece of whitewood 1x12 board from Home Depot is pretty low stakes.
The main body of the rack is all made from a piece of 1x12 pine, with dados cut on the inside before I put it together to accept two pieces of 1/4" plywood as dividers. For finish, I just used one coat of boiled linseed oil, which is pretty minimal, but enough considering the fact that this piece isn't meant to be at all attractive. I mean... the bar has been set awfully low for the last ten years at "random pile of debris"
I think it turned out pretty well for being my first attempt at dovetails. My biggest mistake was that since this board was too wide to fit in my jointer, I just went "eh; it's kind of flat enough..." -- It was not flat enough to accommodate fine joinery. I ended up chasing my tail quite a bit trying to get the dovetails to fit together, since every piece of board was slightly cupped one direction or the other.

But hey, less than perfect or not, this was a step forward in the never ending battle on entropy in my life, so big win!

Getting Android to Work in an IPv6-Only Network

21 September 2019 at 04:00
As previously covered, I recently set up NAT64 on my network so that I didn't need to assign IPv4 addresses to all of my subnets or hosts and they'd still have a usable connection to the Internet at large.

Unfortunately, functional NAT64 is only most useful when hosts can auto-configure themselves with IPv6 addresses, default routes, and DNS64 name servers to resolve with. Figuring out how to get this to work, and then figuring out the workarounds required to get Android to work, took some time...

What Was Intended


Since my network is running off a Cisco 6506, which is a relatively ancient Ethernet switch, it isn't running what you might call the latest and greatest version of an operating system, so it took some historical research to figure out how IPv6 host configuration was even supposed to work at the time IPv6 was implemented on the 6500:

  • A new host connects to the network, and sends out a neighbor discovery protocol/ICMPv6 router solicitation to discover the local IPv6 subnet address and the address of a default gateway to reach the rest of the Internet.
  • The router(s) respond with the local subnet and default gateway, but the ND protocol did not originally include any way to also configure DNS servers, so the router would set the "other-config" flag, which tells the ND client that there is additional information they need that they should query over stateless DHCPv6.
  • The client, now having a local IPv6 address, would then send out a stateless DHCPv6 query to get the addresses for some recursive DNS servers, which the router would answer.
  • The client would now have a self-configure SLAAC address, default gateway, and RDNSS (recursive DNS server), which together enable it to usefully interact with the rest of the Internet.
Great! How do you implement this in IOS?

First you need to define an IPv6 DHCP pool, which ironically isn't really an address pool at all, but just some DNS servers and a local domain name. Realistically, it could be a pool, since IOS does support prefix delegation, but we're not using that here, so we just define what DNS server addresses to use and a local domain name if we felt like it:

ipv6 dhcp pool ipv6dns
Β dns-server 2001:4860:4860::6464
Β dns-server 2001:4860:4860::64
Β domain-name lan.example.com

Since this pool doesn't even really have any state to it, other than maybe defining a different domain-name per subnet, you can reuse the same pool on every subnet that you want DHCPv6 service on, which is what I'm doing on my router, since the domain-name doesn't really make any difference:

interface Vlan43
Β description IPv6 only subnet
Β no ip address
Β ipv6 address 2001:DB8:0:1::1/64
Β ipv6 nd other-config-flag
Β ipv6 dhcp server ipv6dns

IOS sends out router advertisements by default, so to enable handing out addresses and default gateways we just need to not disable that, but the "ipv6 nd other-config-flag" option is what sets the bit in the router advertisements to tell clients to come back over DHCPv6 to also ask for the DNS server addresses.Β 

Now, before I outline the pros and cons for this design, I will disclaim that this is my perspective on the issue, so I'm not speaking from a place of authority, having not yet graduated elementary school when all of this originally was designed... This back and forth of ND - DHCPv6 does have some upsides:
  • Since both the ND and DHCPv6 queries are "stateless", all that the router/DHCP server is doing is handing out information it knows the answers to, and isn't adding any per-device information into any sort of database like a DHCP lease table, so a single router could now conceivably assign address to a metric TON of devices.
  • The separation of the DNS configuration from the IPv6 addressing configuration preserves the elegant separation of concerns that protocol designers like to see because it makes them feel more righteous.
There are also some pretty serious downsides:
  • Instead of just a DHCP server, you now need both a correctly configured router advertisement for the L3 addressing information and a correctly configured DHCPv6 server to hand out DNS information.
  • I still don't understand how hostname DNS resolution is supposed to work for this. In IPv4 land, you use something like dnsmasq which both hands out the DHCP leases and then resolves the hostnames back to those same IP addresses. Since all of this host configuration in IPv6 is stateless, by design DNS can't work... Maybe the presumption was that dynamic DNS wouldn't turn out to be a still-born feature?
  • The Android project, for reasons which defy understanding, refuses to implement DHCPv6.Β 
That last point is a hugely serious issue for my network, since without DHCPv6, there is no mechanism for my Cisco 6506 to communicate to my phone what DNS servers to use over IPv6. My phone gets an IPv6 address and default gateway from my Cisco's router advertisement ICMPv6 packet, and then ignores the "other-config" flag, and is left without any way to resolve DNS records.Β 

Making the network... you know... useless.

For the record, how Android is presumed to work is by utilizing a later addition to the ICMPv6 router advertisement format, RFC6106, which added a Recursive Domain Name Service Server (RDNSS) option to the router advertisement to allow DNS information to be included in the original RA broadcast along with the local subnet and default gateway addressing information. Unfortunately, since this addition to ICMPv6 was made about fifteen years late, RDNSS options aren't supported by the version of IOS I'm running on my Cisco 6506, so it would seem I'm pretty shit out of luck when it comes to running Android on my IPv6-only subnets of my network.

My (Really Not) Beautiful Solution


So we've got a router that doesn't support the RDNSS option in its router advertisements since it predates the concept of RA RDNSS, and we have one of the most popular consumer operating systems which refuses to support DHCPv6, leaving us as an impasse for configuring DNS servers. I actually spent a few weeks thinking about this one, including slowly digging deeper and deeper into the relevant protocols, before I eventually was reading the raw specifications for the ICMPv6 router advertisement packets (kill me), and realized that there was the ability to have a router broadcast Router Advertisement packets while indicating in the RA packet that they shouldn't be used for routing.
So here's my solution, which admittedly even I think feels a little dirty, but an ugly solution that works... still works.

  • My Cisco 6506 sends out Router Advertisements specifying the local subnet, and that it should be used as a default gateway.
  • I then spun up a small Linux VM on the same subnet running radvd, which advertises that this VM shouldn't be used as a default gateway, but does advertise an RDNSS option pointing to my DNS64 resolver as the DNS server to use, since radvd supports RDNSS.
  • Any normal functional IPv6 stack will receive the Cisco's RA packet, note the "other-config" flag, and send a DHCPv6 query to receive the address for a DNS server.
  • Android receives the Cisco's RA, configures its address and gateway, but ignores the "other-config" flag. The phone will then receive a second RA packet from the Linux server running radvd, which includes the RDNSS option which I'm not able to configure on my router, and Android maybe will merge the two RA configurations together to collectively generate a functional subnet, gateway, and DNS server configuration for the network.
Now let us be very clear; while I was setting this up, I was very confident that this was not going to work. Expecting Android to receive two different RAs from two different hosts with only parts of the information it needed, and combining those together, seems like an insane solution to this problem.

The bad news is that this actually worked.

So we spin up a VM, install radvd on it, and assign it one network interface on my WiFi vlan. The /etc/radvd.conf file is relatively short:
kenneth@kwfradvd:/etc$ cat radvd.confΒ 
interface ens160 {
AdvSendAdvert on;
IgnoreIfMissing on;
AdvDefaultLifetime 0;
RDNSS 2001:4860:4860::6464 {};
RDNSS 2001:4860:4860::64 {};
};

The network interface happens to be ens160, the "AdvDefaultLifetime 0;" parameter indicates that this router shouldn't be used as a default gateway, and the two RDNSS parameters specify which IP addresses to use for DNS resolution. Here I show it with both of Google's DNS64 servers, but in reality I'm running my own local DNS64 server, because I'm directly peered with two root DNS servers and b.gtld, so why not run my own resolver?

I still feel a little dirty that this works, but it does! I then installed an access point in each of my data center cabinets on this vlan advertising the SSID "_Peer@FCIX" so we're both advertising my little Internet Exchange project and I've got nice fast WiFi right next to my cabinet.

Using Catalog Zones in BIND to Configure Slave DNS Servers

20 September 2019 at 04:00
I was recently asked to help one of my friends think about re-architecting his anycast DNS service, so I've been thinking about DNS a lot on the backburner for the last few months. As part of that, I was idly reading through the whole ISC knowledge base this week, since they've got a lot of really good documentation on BIND and DNS, and I stumbled across a short article talking about catalog zones.

Catalog zones are this new concept (well, kind of new; there's speculative papers about the concept going back about a decade) where you use DNS and the existing AXFR transfer method to move configuration parameters from a master DNS server to its slaves.

In the context of the anycast DNS service I'm working on now, this solves a pretty major issue which is the question on how to push new configuration changes to all of the nodes. This DNS service is a pretty typical secondary authoritative DNS service with a hidden master. This means that our users are expected to run their own DNS servers serving their zones, and we then have one hidden master which transfers zones from all these various users' servers, and then redistribute the zones to the handful of identical slave servers distributed worldwide which all listen on the same anycast prefix addresses.
Updates to existing zones is a standard part of the DNS protocol, where the customer updates their zone locally, and when they increment their SOA serial number, their server sends a notify to our hidden master, which initiates a zone transfer to get the latest version of their zone, and then sends a notify to the pool of anycast slaves to all update their copy of the customer zone from the hidden master. Thanks to the notify feature in DNS, pushing an updated zone to all of these slave authoritative servers happens pretty quickly, so the rest of the Internet sending queries to the anycast node nearest to them start seeing the new zone updates right away.

The problem is when you want to add new zones to this whole pipeline. After our customer has created the new zone on their server and allowed transfers from our master, we need to configure our hidden master as a slave for that zone pointed to the customer server, and we need to configure all of the anycast nodes to be a slave for the zone pointed back at our hidden master. If we miss one, you start experiencing very hard to troubleshoot issues where it seems like we aren't correctly being authoritative for the new zone, but only in certain regions of the Internet depending on which specific anycast node you're hitting. Anycast systems really depend on all the instances seeming identical, so it doesn't matter which one you get routed to.

There are, of course, hundreds of different ways to automate the provisioning of these new zones on all of the anycast nodes, so this isn't an unsolved issue, but the possible solutions range anywhere from a bash for loop calling rsync and ssh on each node to using provisioning frameworks like Ansible to reprovision all the nodes any time the set of customer zones changes.

Catalog zones is a clever way to move this issue of configuring a slave DNS server for what zones it should be authoritative for into the DNS protocol itself, by having the slaves transfer a specially formatted zone from the master which lists PTR records for each of the other zones to be authoritative for. This means that adding a new zone to the slaves no longer involves changing any of the BIND configuration files on the slave nodes and reloading, but instead is a DNS notify from the master that the catalog zone has changed, an AXFR of the catalog zone, and then parsing this zone to configure all of the zones to also transfer from the master. DNS is already a really good protocol for moving dynamic lists of records around using the notify/AXFR/IXFR mechanism, so using it to also manage the dynamic list of zones to do this for is in my opinion genius.

Labbing It at Home


So after reading the ISC article on catalog zones, and also finding an article by Jan-Piet on the matter, I decided to spin up two virtual machines and have a go at using this new feature available in BIND 9.11.

A couple things to note before getting into it:

  • Catalog zones are a pretty non-standard feature which is currently only supported by BIND. There's a draft RFC on catalog zones, which has already moved past version 1 supported in BIND, so changes are likely for this feature in the future.
  • Both of the tutorials I read happened to use BIND for theΒ  master serving the catalog zone, and used rndc to dynamically add new zones to the running server, but this isn't required. Particularly since we're using a hidden master configuration, there's no downside to generating the catalog zone and corresponding zone configurations on the master using any provisioning system you like, and simply restarting or reloading that name server to pick up the changes and distribute them to the slaves, since the hidden master is only acting as a relay to collect all the zones from various customers and serve as a single place for all the anycast slaves to transfer zones from.
  • This catalog zone feature doesn't even depend on the master server running BIND. As far as the master is concerned, the catalog zone is just another DNS zone, which it serves just like any other zone. It's only the slaves which receive the catalog zone which need to be able to parse the catalog to dynamically add other zones based on what they receive.
We want to keep this exercise as simple as possible, so we're not doing anything involving anycast, hidden masters, or adding zones to running servers. We're only spinning up two servers, in this case both running Ubuntu 18.04, but any distro which includes BIND 9.11 should work:
  • ns1.lan.thelifeofkenneth.com (10.44.1.228) - A standard authoritative server serving the zones "catalog.ns1.lan.thelifeofkenneth.com", "zone1.example.com", and "zone2.example.com". This server is acting as our source of zone transfers, so there's nothing special going on here except sending notify messages and zone transfers to our slave DNS server.
  • ns2.lan.thelifeofkenneth.com (10.44.1.234) - A slave DNS server running BIND 9.11 and configured to be a slave to ns1 for the zone "catalog.ns1.lan.thelifeofkenneth.com" and to use this zone as a catalog zone with ns1 (10.44.1.228) as the default master. Via this catalog zone, ns2 will add "zone1.example.com" and "zone2.example.com" and transfer those zones from ns1.
We first want to set up ns1, which is a normal authoritative DNS configuration, with the one addition that I added logging for transfers, since that's what we're playing with here.Β 

Nothing too unexpected there; turn off recursion service, and turn on logging.

The only particularly unusual thing about the definition of the zone files is that I'm explicitly listing the IP address for the slave server under "also-notify". I'm doing that here because I couldn't get it to work based on the NS records like I think it should, but that might also be because I'm using zones not actually delegated to me.Β 
In my actual application, I'll need to to use also-notify anyways, because I need to send notify messages to every anycast node instance on their unicast addresses. In a real application I would also lock down the zone transfers to only allow my slaves to transfer zones from the master, since it's generally bad practice to allow anyone to download your whole zone file.

The two example.com zone files are also pretty unremarkable.Β 

Up until this point, you haven't actually seen anything relevant to the catalog zone, so this is where you should start paying attention! The last file on ns1 of importance is the catalog file itself, which we'll dig into next:

Ok, so that might look pretty hairy, so let us step through that line by line.
  • Line 1: A standard start of authority record for the zone. A lot of the examples use dummy zones like "catalog.meta" or "catalog.example" for the catalog zone, but I don't like trying to come up with fake TLDs which I just have to hope isn't going to become a valid TLD later, so I named my catalog zone under my name server's hostname. In reality, theΒ  name of this zone does not matter, because no one should ever be sending queries against it; it's just a zone to be transferred to the slaves and processed there.
  • Line 3: Every zone needs an NS record, which again can be made a dummy record if you'd like, because no one should ever be querying this zone.Β 
  • Line 5: To tell the slaves to parse this catalog zone as a version 1 catalog format, we create a TXT record for "version" with a value of "1". It's important to remember the importance of a trailing dot on record names! Since "version" doesn't end in a dot, the zone is implicitly appended to it, so you could also define this record as "version.catalog.ns1.lan.thelifeofkenneth.com." but that's a lot of typing to be repeated, so we don't.
  • Lines 7 and 8: This is where the actual magic happens, by defining unique PTR records with values for each of the zones which this catalog file is listing for the slaves to be authoritative for. This is somewhat of an abuse of the PTR record meaning, but adding new record types has proven impractical, so here we are. Each record is a [unique identifier].zones.catalog.... etc.
The one trick with the version 1 catalog zone that's implemented by BIND is that the value of the unique identifier per cataloged zone is pretty specific. It is the hexadecimal representation of the SHA1 sum of the on-the-wire format of the cataloged zone.

I've thought about it quite a bit, and while I can see some advantages to using a stable unique identifier like this per PTR record, I don't grasp why BIND should strictly require it, and reading the version 2 spec in the draft RFC, it looks like they might loosen this up in the future, but for now we need to generate the exact hostname expected for each zone. I did this by adding a python script to my local system based on Jan-Piet's code:

I needed to install the dnspython package (pip3 install dnspython), but I could then use this python script to calculate the hash for each zone, and add it to my catalog zone by appending ".zones" to it and adding it as a PTR record with the value of the zone itself. So looking back at line 7 of the catalog zone file, by running "dns-catalog-hash zone1.example.com" the python script spit out the hash "ddb8c2c4b7c59a9a3344cc034ccb8637f89ff997" which is why I used that for the record name.

Now before we talk about the slave name server, I want to again emphasize that we haven't utilized any unusual features yet. NS1 is just a normal DNS server serving normal DNS zones, so generate the catalog zone file any way you like, and ns1 can be running any DNS daemon which you like. Adding each new zone to ns1 involves adding it to the daemon config like usual, and the only additional step is also adding it as a PTR record to the catalog zone.

On to ns2! This is where things start to get exciting, because what I show you here will be the only change ever needed on ns2 to continue to serve any additional zones we like based on the catalog zone.

Again, we've turned off recursion, and turned on transfer logging to help us see what's going on, but the important addition to the BIND options config is the addition of the catalog-zones directive. This tells the slave to parse the named zone as a catalog zone. We do explicitly tell it to assume the master for each new zone should be 10.44.1.228, but the catalog zone format actually supports you explicitly defining per zone configuration directives like masters, etc. So just appreciate that we're using the bare minimum of the catalog zone feature here by just adding new zones to transfer from the default master.

This is the totally cool part about catalog zones right here; our local zones config file just tells the slave where to get the catalog from, and BIND takes it from there based on what it gets from the catalog.

If you fire up both of these daemons, with IP addresses and domain names changed to suit your lab environment, and watch the /var/cache/bind/zone_transfers log files, you should see:
  • ns1 fires off notify messages for all the zones
  • ns2 start a transfer of catalog.ns1.lan.thelifeofkenneth.com and processes it
  • Based on that catalog file ns2 starts additional transfers for zone1.example.com and zone2.example.com
  • Both ns1 and ns2 are now authoritative for zone[12].example.com!

To verify that ns2 is being authoritative like it should be, you can send it queries like "dig txt test.zone1.example.com @ns2.lan.thelifeofkenneth.com" and get the appropriate answer back. You can also look in the ns2:/var/cache/bind/ directory to confirm that it has local caches for the catalog and example.com zones:
The catalog file is cached in whatever filename you set for it in the named.conf.local file, but we never told it what filenames to use for each of the cataloged zones, so BIND came up with its own filenames for zone[12].example.com starting with "__catz__" and based on the catalog zone's name and each zone's name itself.

Final thoughts


I find this whole catalog zone concept really appealing since it's such an elegant solution to exactly the problem I've been pondering for quite a while.

It's important to note that this set of example configs aren't production worthy, since this was just me in a sandbox over two evenings. A couple problems off the top of my head:
  • You should be locking down zone transfers so only the slaves can AXFR the catalog zone and all of your other zones, since otherwise someone could enumerate all of your zones and all the hosts on those zones.
  • You probably should disallow even any queries against the catalog zone. I didn't, since it made debugging the zone transfers easier, but I can see the downside to answering queries out of the catalog. It wouldn't help enumerate zones, since it'd be easier to guess zone names and query for their SOAs than guessing the SHA1 sums of the same zone names and asking for the PTR record for it out of the catalog, but if you start using more sophisticated features of the catalog zone like defining per-zone masters or other configuration parameters, you might not want to allow those to be available for query by the public.

Making a Walnut Guest WiFi Coaster

19 September 2019 at 17:00

I was recently reading about the 13.56MHz NFC protocol and the standard tags you can write and read from your phone, when I realized that one of the features of NFC is that you can program tags with WiFi credentials, via the concept of NDEF records, which let you encode URLs, vCards, plain text, etc.

I thought this would be a good gift idea, so I bought some NFC tags on eBay, and then built a coaster around it using walnut and sand blasted glass.



The main thing for building one of these is getting some NFS tags, which you can easily find on Amazon, and writing your WiFi credentials to it as an NDEF record, which is possible using various phone apps, including the TagWritter app from NXP, which is surprisingly good for being a vendor app.

Adding Webseed URLs to Torrent Files

5 September 2019 at 19:00
I was recently hanging out on a Slack discussing the deficiencies in the BitTorrent protocol for fast file distribution. A decade ago when Linux mirrors tended to melt down on release day, Bittorrent was seen as a boon for being able to distribute the relatively large ISO files to everyone trying to get it, and the peer-to-peer nature of the protocol meant that the swarm tended to scale with the popularity of the torrent, kind of by definition.

There were a few important points raised during this discussion (helped by the fact that one of the participants had actually presented a paper on the topic):

  1. HTTP-based content distribution networks have gotten VASTLY better in the last decade, so you tend not to see servers hugged to death anymore when the admins are expecting a lot of traffic.
  2. Users tend to see slower downloads from the Bittorrent swarm than they do from single healthy HTTP servers, with a very wide deviation as a function of the countless knobs exposed to the user in Bittorrent clients.
  3. Maintaining Bittorrent seedbox infrastructure in addition to the existing HTTP infrastructure is additional administrative overhead for the content creators, which tends to not be leveraged as well as the HTTP infrastructure for several reasons, including Bittorrent's hesitancy to really scale up traffic, its far from optimal access patterns across storage, the plethora of abstract knobs which seem to have a large impact on the utilization of seedboxes, etc.
  4. The torrent trackers are still a central point of failure for distribution, and now the content creator is having to deal with a ton of requests against a stateful database instead of just serving read-only files from a cluster of HTTP servers which can trivially scale horizontally.
  5. Torrent files are often treated as second class citizens since they aren't as user-friendly as an HTTP link, and may only be generated as part of releases to quiet the "hippies" who still think that Bittorrent is relevant in the age of big gun CDNs.
  6. Torrent availability might be poor at the beginning and end of a torrent's life cycle, since seedboxes tend to limit how many torrents they're actively seeding. When a Linux distro drops fifteen different spins of their release, their seedbox will tend to only seed a few of them at a time and you'll see completely dead torrents several hours if not days into the release cycle.Β 
As any good nerd discussion on Slack goes, we started digging into the finer details of the Bittorrent specification like the Distributed Hash Table that helped reduce the dependence on the central tracker, peer selection algorithms and their tradeoffs, and finally the concept of webseed.

Webseed is a pretty interesting concept which was a late addition to Bittorrent where you could include URLs to HTTP servers serving the torrent contents, to hopefully give you most of the benefits of both protocols; the modern bandwidth scalability of HTTP, and the distributed fault tolerance and inherent scaling of Bittorrent as a function of popularity.

I was aware of webseed, but haven't seen it actually used in years, so I decided to dig into it and see what I could learn about it and how it fits into the torrent file structure.

The torrent file, which is the small description database which you use to start downloading all of the actual content of a torrent, at the very least contains a list of the files in the torrent and checksums for each of the fixed-size chunks making up those files. Of course, instead of using a popular object serializer like XML or JSON (which I appreciate might not have really been as popular at the inception of Bittorrent), the torrent file uses a format I've never seen anywhere else called BEncoding.

The BEncoding format is relatively simple; key-value pairs can be stored as byte strings or integers, and the file format supports dictionaries and lists, which can contain sets of further byte strings, integers, or even other lists/dictionaries. Bittorrent then uses this BEncoding format to create a dictionary named "info" which contains a list of the file names and chunk hashes which define the identity of a torrent swarm, but beyond this one dictionary in the file, you can modify anything else in the database without changing the identity of the swarm, including which tracker to use as "announce" byte-strings, or "announce-list" lists of byte-strings, comments, creation dates, etc.

Fortunately, the BEncoding format is relatively human readable, since length fields are encoded as ASCII integers, field delimiters are characters like ':', 'l', and 'i', but unfortunately this is all encoded as a single line with no breaks, so trying to edit this database by hand with a text editor might be a little hairy.
I wasn't able to find a tremendous amount of tooling for interactively editing BEncode files; there exists a few online "torrent editors" which give you basic access to changing some of the fields which aren't part of the info dictionary, but none of them seemed to give the arbitrary key-value editing capabilities I needed to play with webseed, so I settled on a Windows tool called BEncode Editor. The nice thing about this tool is that it's designed as an arbitrary BEncode editor, instead of specifically a torrent editor, so it has that authentic "no training wheels included" hacker feel to it. User beware.Β 

As an example, I grabbed the torrent file for the eXoDOS v4 collection, which is a huge collection of 7000 DOS games with various builds of DOSBOX to make it all work on a modern system. Opening the torrent file in BEncode Editor, you can see the main info dictionary at the end of the main root dictionary, which is the part you don't want to touch since the info dictionary is what defines the identity of the torrent. In addition to that, you can see five other elements in the root dictionary, including a 43 byte byte string named "announce" which is a URI to a primary tracker to use to announce yourself to the rest of the swarm, a list of 20 elements named "announce-list" which is alternative trackers to use (the file likely contains both the single tracker and a list of trackers for backwards compatibility for Bittorrent clients which predate the concept of announce-lists?) and some byte strings labeled "comment", "created by", and an integer named "creation date", which looks like a Unix timestamp.

Cool! So at this point, we have an interactive tool to inspect and modify a BEncode database, and know which parts to not touch to avoid breaking things (The "info" dictionary).

Now back to the original point of somehow adding webseed URLs to a torrent file


Webseeding is defined in Bittorrent specification BEP_0019, which I didn't find particularly clear, but the main takeaway for me is that to enable webseeding, I just need to add a list to the torrent named "url-list", and then add byte-string elements to that list which are URLs to HTTP/FTP servers serving the same contents.

So first step, log into one of my web servers and download the torrent and throw the contents in an open directory. (In my case,Β https://mirror.thelifeofkenneth.com/lib/) For actual content creators, this should be part of their normal release workflow for HTTP hosting of the content, so this is only really needed for when you're retrofitting webseed into an existing torrent.
Now we start editing the torrent file, by adding a "url-list" list to the root dictionary, and the part I found a little tricky was figuring out how to add the byte-string child to the list, which is done in BEncode Editor by clicking on the empty "url-list" list, and clicking "add" and specifying that the new element should be added as a "child" of the current element.
Referring back to BEP_0019, if I end the URL with a forward slash, the client should append the info['name'] to the URL, so the binary string I'm adding as a child to the list is "https://mirror.thelifeofkenneth.com/lib/" such that the client will append "eXoDOS" to it, looking for the content at "https://mirror.thelifeofkenneth.com/lib/eXoDOS/", which is correct.

Save this file as a new .torrent file, and success! Now I have a version of the eXoDOS torrent with the swarm performance supplemented by my own HTTP server! The same could be done for any other torrent where the exact same content is available via HTTP, and honestly I'm a little surprised that I don't tend to see Linux distros using this, since it reasonably removes the need for them to commit to maintaining torrent infrastructure since the torrent swarm can at least survive off of an HTTP server, which the content creator is clearly already running.

Building Your Own Bluetooth Speaker

22 July 2019 at 17:30
Video:


I recently found a nice looking unpowered speaker at a thrift shop, so I decided to turn it into a bluetooth speaker so I could use it in my apartment with my phone. The parts list is pretty short:

Using 0603 Surface Mount Components for Prototyping

5 June 2019 at 18:00
Β As a quick little tip, when I'm prototyping circuits on 0.1" perf board, I like using 0603 surface mount components for all of my passives and LEDs, since they nicely fit between the pads. This way I don't need to bother with any wiring between LEDs and their resistors, since I can just use three pads in a row to mount the LED and their corresponding current limiting resistor and just need to wire the two ends to where they're going.
I also like using it as a way to put 0.1uF capacitors between adjacent pins on headers, so filtering pins doesn't take any additional space or wiring.

The fact that this works makes sense, since "0603" means that the surface mount chips are 06 "hundredths of an inch" by 03 "hundredths of an inch", so they're 0.06" long, which fits nicely between 0.1" spaced pads.

I definitely don't regret buying an 0603 SMT passives book kit, which covers most of the mix of resistors and capacitors I need. I then buy a spool of anything that I manage to fully use up since it's obviously popular, so I eventually bought a spool of 0.1uF capacitors and 330 ohm resistors to restock my book when those strips ran out.

❌
❌