| config | ||
| img | ||
| meshcoresnr | ||
| packet | ||
| parser | ||
| tracker | ||
| config-example.yaml | ||
| docker-compose.yaml | ||
| Dockerfile | ||
| go.mod | ||
| go.sum | ||
| LICENSE | ||
| README.md | ||
meshcoresnr
This is a simple program to parse signal metrics from received packets on a MeshCore device and serve them as a Prometheus endpoint for further processing.
It accepts input from either the serial port debug output directly, or using an MQTT server with data from Cisien's fork of meshcoretomqtt.
Building
A fairly straightforward golang program with few dependencies. If compiling to run standalone, you can run (from the base dir):
cd meshcoresnr
go build
If you'd prefer a Docker image:
docker build -t meshcoresnr:latest .
Building the MeshCore repeater requires you to compile a custom firmware and uncomment:
-D MESH_PACKET_LOGGING=1
This is the same requirement as meshcoretomqtt, since it prints out SNR, RSSI, and RAW packet info that we need to track signal strength. I believe it only emits this data with a repeater build at the moment.
Running
There's only one required flag, -config [config.yaml filename]
This should contain all the relevant configuration for your chosen mode. For MQTT, this is mostly the MQTT host to connect to and the username/password. For serial it's just the serial port. Both methods do have some shared configuration, the prometheus port to run /metrics on, and an optional regex to filter to only certain origin IDs. There's an example yaml with some comments at config-example.yaml.
If you're running this bare metal, ./meshcoresnr -config config.yaml will work if you have everything defined.
If you're running under docker, you will want to map the config.yaml file to /conf/config.yaml, and forward the appropriate port to provide external access. I also am not using a registry, so you'll want to build and tag the image locally. I included a sample docker-compose.yaml file, but you'll want to tweak it for your own needs (device names and ports and whatnot).
To get more debug output, add -v # (with # being between 1 and 5) for increasing amounts of garbage that I thought might be helpful.
A quick note that it will simply exit upon either the MQTT connection being lost or errors reading serial data - this relies upon docker (or whatever service you've setup) restarting the program and doing backoff over time, so keep that in mind.
Metrics
There are various metrics emitted, all under the meshcore_ prefix, and all with origin_id, origin_short, and node labels.
- meshcore_rx_count - a counter of all messages received, bucketed by origin node. ticks up by 1 for each received packet
- meshcore_tx_count - same, but for transmitted messages
- meshcore_node_rx_count - a counter of messages received, bucketed by origin node and remote node
- meshcore_route_type_rx_count, meshcore_payload_type_rx_count (guess!)
- meshcore_node_snr - the average SNR of the last 15 minutes of packets heard from a node
- meshcore_node_rssi - the average RSSI of the last 15 mintues of packets heard from a node
- meshcore_node_airtime, meshcore_payload_type_airtime - a float representing the percentage of airtime taken up by either each node or each payload_type averaged over the last 15 minutes
If a remote node isn't heard from in 15 minutes, node_snr and node_rssi metrics are removed for that node (so that you don't see a flat line going on forever if a repeater has gone offline or isn't reaching you).
Questions I pretend are frequently asked but actually I made up in my head
How is this useful, why did you make it?
When I first started running MeshCore, I did a lot of looking through packet logs on my repeater while making adjustments just trying to get a connection. Debug output like SNR helped a lot, but it would go up and down and you really had to look through a lot of them to figure out if something was getting better or worse after a change and it was all pretty hand-wavey. Prometheus metrics are great when you expect to see changes over time. As a simple example, repositioning an antenna and then coming back an hour later to see what received SNR has done for all your neighbors. Similarly, I've noticed changes in propagation over time, which allow me to say things like "this neighbor is sitting at 10 SNR now, sure, but I've also seen it change to -5 SNR for multiple days so I probably can't depend on it".
What is Prometheus and how do I set that up?
Probably beyond the scope of this README; it's not hard to configure necessarily, but it might be frustrating to go from 0->100 just to get some MeshCore metrics out the other end (it has a lot of subcomponents and yaml configuration). This program is more intended for those who, like me, already have a Prometheus instance running and want to add MC data to it. You're absolutely welcome to take the code and incorporate into whatever other timeseries metric system you have lying around though!
Serial or MQTT? Which should I choose??
I started this intending to only support MQTT - the meshcoretomqtt script does a fine job of shipping debug output off-device and that data can be used for all sorts of things at once with multiple consumers (e.g. map.w0z.is). But frankly, asking people to incorporate this into Prometheus and also spin up an MQTT server just for one device seemed a bit much. So the serial option is for if you don't have an MQTT server and don't want one. It'll only support one origin ID (the serial device you connect it to) but it's more self-contained. If you're already using MQTT or you want support for multiple repeater metrics at once - use the MQTT configuration!
OriginFilter?? What is this nonsense?
Yeah sorry, my naming is not always the best. This is a regex field that filters based upon the repeater that received the packet. For serial, there's not much use - you are only ever connected to the same repeater so you probably do want those metrics. For MQTT though, you can be connected to a server that receives metrics for many many repeaters, and if one of them were up on a mountain, where it could hear a whole bunch of remote nodes... well, you might not want to deal with the cardinality for all those extra metrics. Most people won't need or want it, but if you do, the easiest thing would be just putting your repeater's pubkey in as a string, all hex lowercase. If you don't include it, all metrics are recorded.**
** As a quick note, this does mean that if you're not interested in compiling your own firmware, you could technically use this software to report metrics for other repeaters that send data to an MQTT server.
How does the averaging of metrics work (in more detail)?
All metrics are now aggregated for a certain ExpireDuration (15 minutes by default, but you can define in the configuration!) and expired once they are outside of that window. Shorter durations will make your metrics bounce around a lot more but be more quickly responsive to changes. Longer durations give you that smoooooooooooothed out vibe.
What about when you first add a metric? When you have less, you can either: average across the few that exist (in extreme, 1 packet or a couple seconds for airtime) which causes a lot of jitter, or not emit metrics. For this reason I include SuppressDuration in the configuration too. This means "when first recording a measurement, do not emit prometheus metrics for SuppressDuration time but DO store them for aggregation". The default SuppressDuration is 90 seconds, so in that case you will have a buffer of 90 seconds of data before Prometheus starts reporting it out. If you don't like this, you can set it to 0s to get XTREME JITTER (or, set it to 15m to emit only when you have the full buffer of data).
Some neat things you could do with the metrics
Coming up with a dashboard is its own unique challenge. I figured I'd share a couple simple queries here to get you started, and then hopefully you can leave me in the dust with your glorious visualizations:
avg by (node) ((meshcore_node_snr{origin_short="$origin_short"}))
You may or may not care about the origin_short bit, if you only have one origin ID it won't matter. But if you're connected to a bunch of repeaters through MQTT, this will show a view of remote node SNR for that one repeater. Grafana (and presumably others) can make this a variable for filtering on the fly. This is the main interesting thing to watch for "how is signal?"
avg by (node) (rate(meshcore_node_rx_count{origin_short="$origin_short"}[60m])*60) > 0
Per minute rate of RX by node. A good way to track how many packets you're getting from your remote repeaters, so regardless of the signal strength you can actually compare the packets being successfully decoded. Phrased another way, if you're getting 1 packet per minute from one node and 0.5 ppm from another, you might assume that you're missing half of the flood traffic from the latter. This isn't precisely 100% accurate, but we're only interested in flood routes so it seems to hold up in practice.
I haven't done this yet, but there's nothing that would prevent you from doing a two way graph if you have two repeaters that can see each other and are reporting that data to MQTT. Match origin_short=src, node=dst one way, and origin_short=dst, node=src the other. Then you could see the RX SNR from both perspectives at once.
Stack up airtime, either by node (N/A is used for direct packets in that case to keep the same total), or by payload type. This can help you determine both what your repeater might be seeing for local collisions, changes over time, and relative mixes of traffic types. Unlike the repeater statistics, this will be averaged across the ExpireDuration rather than being the total since the repeater was booted. For a better view of the relative portions of traffic, try something like this:
(sum(meshcore_payload_type_airtime{origin_short="$origin_short"}) by (payload_type) / on() group_left() sum(meshcore_payload_type_airtime{origin_short="$origin_short"}))
This normalizes the total to 1.0, so you can see how the percentage of traffic changes for each type.
One caveat about pub keys
Ending on a downer! If you look at the packet structure (or see your path hops in the phone app) you know that paths are only tracked by the first byte of a pub key and you can therefore have collisions between different repeaters. The only packets where this isn't the case are adverts, and those come so infrequently it didn't seem super useful for me to add a metric specifically for that (and you'd be better served by consulting the neighbors list of your repeater at that point). What this means is that if you have two repeaters with the same first byte (say, 0x11), they will both be shown as node=11 in the Prometheus metrics. If only one of them is directly connected to your origin_id, you might reasonably conclude "hey it's that one", but if both are, well, then you're going to be averaging those metrics for SNR and RSSI. This is in the protocol, so there's not much I can change about it.
If you really want to get around it probably the best thing to do is coordinate with other repeater operators and make sure your "big" IDs (say, the main link between regions, or the repeater that serves a whole metro) stay as unique as possible. Do keep in mind that we're all amateurs here though and some people may not want to keep shuffling keys around just to make the metrics look nice.
