Endian Blog


Don't roll your own IoT Protocol

The pitfalls of IoT development

We’re all aware of how the surge of open-source has transformed the way we build software. Nowadays, most successful projects follow the same blueprint - by leveraging collaboration through open-source development, you minimize the time spent on the things that have little to do with your core business. Even if it was possible to write a smarter database engine or a more efficient serialization protocol than what’s already provided by community collaboration, odds are it won’t be worth your time. As a whole, this is well recognized by the software business.

Even the embedded Linux world seems to have caught on to this for the most part. However, this does not seem to be the case for the world of sensor nodes and other resource-constrained devices.

In the world of tiny microcontrollers, you still see a lot of business logic code squeezed into copy pasted vendor HALs, serializing data by memcpy:ing structs, sending actuation commands via JSON-encoded strings and homebrewed encryption protocols leading to embarassing posts on hackernews.

Why is this? One factor is that, by tradition, firmware development has had more in common with hardware design than application development. The challenges faced by embedded developers in the past were perhaps not best solved by obsessing over architectural modularity and leveraging large-scale collaboration.

But while we still have to troubleshoot noisy clock signals and inexplicable watchdog resets, it’s not just voltages, registers and pins anymore. You’re also expected to somehow couple the ever-increasing complexity in application logic with stringent constraints on robustness, power consumption, network availability, latency and bandwidth efficiency. Oh and by the way, we also expect you to support remote firmware upgrades, state of the art encryption and a server-facing API that’s easy to work with.

It’s easy to not fully grasp the scope of this, especially since prototyping an IoT product is so incredibly simple - just solder a sensor breakout board to a devkit, wrap it in JSON and pipe the data to some hardcoded static IP over WiFi. If you think you’re 95% done at this point, perhaps it shouldn’t be surprising if you think you can “wing it” the rest of the way.

But as the project continues and requirements become more clear, you’ll soon discover that the devil is in the details. There’s endianness bugs in the data encoding algorithm. You spend hours in meetings with the cloud team discussing protocol details. Can you send the serial number only once per boot? Should we ack RPCs on reception or completion? Do we need a heartbeat? How do we serialize that? The timestamp is in seconds since boot, cloud insists you implement NTP. You realize too late that your SDK doesn’t support DNS. Or encryption. How do we upgrade the firmware? Can we have automatic rollbacks? Should we pause data uplinks while we’re upgrading?

While this picture may be somewhat too bleak, it hopefully serves to illustrate that there are many hidden pitfalls when it comes to architecture and protocol design. Pitfalls that take time to discover on your own, time much better spent on focusing on improving the core business logic of your product.

The light at the end of the tunnel

It’s not that creating IoT products is impossible if you stick to the dated model of constantly reinventing the wheel. But we believe it’s possible to build products that are more efficient, more stable, more secure, more flexible, more featureful AND have a much shorter time to market by adapting a more modern approach.

Modern products are built using modern development principles: By maximizing use of modular components, collaboratively developed and maintained, communicating with each other using APIs and protocols that are well-defined, open and proven by use.

Up until fairly recently, the lack of an active open-source community focusing on resource-constrained devices made this practically impossible. The modular components were mostly proprietary and closely tied to specific hardware and the popular open protocols were either unsuited for devices running on coin cell batteries or simply lacked the community traction required for a protocol to be considered well established.

The Linux Foundation realized that a unified community is a prerequisite for large-scale collaboration and so Zephyr RTOS was introduced. We never get tired of evangelizing about Zephyr - we believe it’s a truly transformative open source project that we will continue working with for many, many years to come. If you’re curious about Zephyr, drop us a line and we’ll be happy to chat!

In this article, however, we’d like to give you an introduction to another important landmark - the Lightweight Machine to Machine protocol (LwM2M).

LwM2M

LwM2M is a fairly recent IoT protocol mainly designed for device management. It is based on UDP and the Constrained Application Protocol (CoAP) which, loosely speaking, is designed as a drop-in replacement for HTTP for situations where TCP/IP is either infeasible or otherwise undesirable. There exists support for other transports such as SMS and LoRa, but CoAP/UDP is the main use case.

For full disclosure, let’s make it clear that there’s nothing particularily brilliant about LwM2M - it offers an interface for client/server communications, agreeable addressing semantics for its objects/resource model and not much else.

Of course, LwM2M includes many protocol features that are fantastic (bootstrapping and firmware upgrades), but they’re entirely optional, making the scope of the core protocol rather limited. But the simplicity is exactly what makes LwM2M exciting - simple APIs are easier to agree upon and it’s precisely such consensus that facilitates collaboration and reusability on a massive scale. Indeed, this is what we’re beginning to see with LwM2M.

Object-resource model

In LwM2M, we model a device as a collection of resources, where conceptually related resources may belong to the same object. The server addresses resources by their resource path (contained inside the CoAP URI) which takes the form

"Object[/ObjectInstance]/Resource[/ResourceInstance]"

where brackets denoted an optional component - if the object/resource is single-instance, the instance ID is omitted.

All resources support one or more operations - read, write, execute and delete, which correspond to the GET, PUT, POST and DELETE operations in CoAP.

Furthermore, the server can observe resources. For instance, we can ask for sensor readings to be reported if it exceeds 2 V or it deviates by more than 10 mV compared to the last reported value.

Firmware Over the Air (FOTA)

This is a simple but well-designed API that describes the update state machine as well as the details of the image transfer which can either be “push” (this essentially gives the server write access to the update partition) or “pull” which leverages the Block Transfer mechanism already present in the CoAP specification.

When this API is coupled with a modern bootloader that supports automatic rollback, you get an incredibly robust firmware upgrade system essentially that is essentially plug and play.

Bootstrapping

The Bootstrap API is another optional (but very useful) feature of LwM2M. In short, it lets you provision your devices to connect to a bootstrap server instead of your “primary” server. The bootstrap server can then provide credentials and configurations based on device-specific information, such as location, serial number or device type. The device then disconnects from the bootstrap server and connects to the primary server, using the credentials and configuration obtained from the previous step.

While this may sound slightly round-about, it turns out it can be quite useful. It can greatly simplify provisioning if the devices are to be shipped to many different countries, or used by many different customers who each have their own cloud, because they only need to be provided with the credentials for the bootstrap server during production. The “true” server address does not need to be decided until the time of first deployment, which can potentially be much later.

Wrapping up

One key benefit of adapting a widely adopted protocol is that caveats such as data encoding, timestamp format, state synchronization, heartbeats, etc are already ironed out. You don’t have to think about how to serialize RPC calls, or how to ACK them. It’s all in the specification.

Another key benefit is the inter-operability that the OMA object registry enables, which contains definitions for thousands of common use-cases such as light switches, accelerometers and e-ink displays. The registry also allows for companies to register new custom objects at no charge. This means that any LwM2M device connect to any LwM2M server and immediately provide full read/write/exec access to all its resources - no product-specific schemas or config files needed.

Of course, inter-operability also means that many different types of devices can connect seamlessly to the same server which can potentially streamline operations by a lot for companies with diverse device fleets. Think about that that for a second - how many dev hours would it currently take in order to fully integrate an entirely new line of devices into your cloud solution?

It also helps that high quality, open-source reference implementations for both device and server are readily available. Setting up a demo server with a fully functional device management UI literally takes minutes.

We’ve only mentioned a few of the features provided by the LwM2M specification, so in a way we’re just scratching the surface of the LwM2M protocol. On the other hand, there’s conceptually very little about LwM2M to “get”. LwM2M is simply telling you: “Hi, here’s a common API. If you adhere to it, you will in return get access to the work done by everyone who is also adhering to this API”. It’s nothing magic - it’s just an invite to collaborate.

We’re still in the early stages of adoption (the protocol was officially released in 2017), but it’s already very much suited for industrialization, as we’ve most recently shown by helping Voi deploy it to thousands of their scooters world-wide.

If we’ve managed to pique your interest, don’t hesitate to give us a call, or drop us an e-mail!


LoRaWAN - Improved Range for the TTN Gateway at the Endian Office

LoRa/LoRaWAN is a wireless technology and protocol stack that is often used in the IoT domain when low power consumption and small, infrequent packets align with the use case. It is not a ‘Silver Bullet’ for all needs, but a technology that supplements other IoT links like SigFox, LTE-M and NB-IoT.
At Endian we started to work with LoRa and LoRaWAN in 2016. At that time we also launched our first LoRaWAN gateway connected to The Things Network. In order to improve the range and support the TTN community, we have now migrated to a new gateway and moved our antenna to the roof. This has improved the range substantially! The Endian TTN Gateway is located at Flöjelbergsgatan 11 in Mölndal. Check ttnmapper if you want to see the current status!
Today we see LoRa not only in IoT applications, but also in other domains such as long-range drone control and aspiring mesh applications like Meshtastic.
Exciting times ahead!


Real Internet on an RTOS

Our engineers here at Endian Technologies AB have decades of experience with real-time embedded systems. During the past couple of years we have become experts on Zephyr, a real-time operating system for secure IoT devices. In this article I’d like to highlight a major improvement in its networking drivers and what it means for our future projects.

Say you’re building an IoT device. How does your device get on the Internet? What technology you choose depends on your application. Maybe your application is coupled with a gateway and you can use 6LoWPAN via BLE. Perhaps you even need cabled Internet and you use an Ethernet controller.

These technologies are good, but they are not always right. Sometimes you need cellular or Wi-Fi. Suppose you decide to use a cellular modem with LTE. If your project uses a general purpose OS like Debian then things are pretty easy to get working.

What if your device is so small that it can’t run a general purpose OS? Generally this has meant that your project would be a second-class citizen when it comes to Internet access. You have had to rely on vendor-specific AT commands.

AT vendor commands for networking

To be very concrete, let us say your project is using a Nordic nRF52840 SoC and a cellular NB-IoT modem. Your firmware uses Zephyr and runs on the SoC, which communicates with the modem through a UART. How do you get on the Internet with this combination?

Let’s say you’re pretty green to this cellular modem thing. You would open up the documentation for your modem and find three things within easy reach:

  • Configure an APN (AT+CGDCONT) for the data connection
  • Vendor-specific AT commands for HTTP, FTP, etc
  • Vendor-specific AT commands to open and use TCP and UDP sockets

You do some experiments in minicom and the commands appear to work. With this in mind you get going designing how your device will communicate with the network. You wanted to use CoAP over DTLS but it turns out the modem doesn’t support that. You need data to be encrypted on the network, so you grudgingly decide to use classic HTTP and TLS. Besides, the server guys are already familiar with it, so everyone is happy.

This appears to be a good solution and you start writing the library that will talk with the AT vendor commands for HTTP. After working way too long on this, you finally get the right contact person for your cellular operator and they tell you that you shouldn’t use TCP on NB-IoT. They don’t recommend it at all. Oops!

Your mind is looking for a way out and finds it: Zephyr has an Internet stack and you can adapt it to work with the AT vendor commands for sockets. Luckily for you, you find that Zephyr already has drivers like these, called socket offloading drivers. You decide to use DTLS with a socket offloading driver. It quickly turns out that this combination is not supported since the modem doesn’t support DTLS. But we were going to use Zephyr’s network stack? Turns out the offloading drivers recently went through a change where they hook into the network stack in such a way that Zephyr’s own DTLS/TLS support is not used. Oops!

Real Internet protocols to the rescue

Things are looking pretty desperate for you. You need encryption, but the modem doesn’t support DTLS, which your RTOS does support, but your RTOS doesn’t have a driver that supports DTLS in combination with your modem. TCP is not recommended on NB-IoT, but you’ve meanwhile found out that even if it worked it would mean that your data is sent in clear text on the modem’s UART.

Why are things not this bad on Linux? If your application was using Linux then you wouldn’t have to go anywhere near AT vendor commands. Why not? Because you would use PPP. PPP is like having running water in your house. Networking with AT vendor commands is like bringing water in buckets.

Zephyr 2.0 (September 2019) added support for PPP. Zephyr 2.2 (March 2020) added a GSM modem driver that uses PPP. Zephyr 2.3, which is not yet released, adds GSM 07.10 multiplexing. Much of this work was done by Jukka Rissanen at Intel. It is difficult to convey how significant and important this work has been.

PPP in combination with GSM 07.10 lets Zephyr use cellular modems in just the same way that Linux uses them. The GSM 07.10 protocol provides multiple virtual UARTs over a single physical UART and makes it possible to use AT commands while PPP is up and running. PPP is a full duplex serial protocol with framing and checksumming. It has control protocols (NCPs) for the serial line itself (LCP) and protocols for negotiating Internet addresses and DNS (IPCP and IPV6CP).

Here is a comparison:

  • PPP: full duplex – AT commands: half duplex
  • PPP: framing – AT commands: line-oriented ASCII protocol
  • PPP: checksums – AT commands: no checksums
  • PPP: no waits when transmitting – AT commands: wait for a transmit prompt
  • PPP: DTLS/TLS handled in the RTOS – AT commands: modem handles TLS keys
  • PPP: no application plaintext on the UART – AT commands: plaintext on the UART
  • PPP: supports all Internet protocols – AT commands: support select protocols
  • PPP: standard protocol with RFCs – AT commands: vendor-specific protocol

PPP is better in every way. There is of course some place in the world for networking via AT commands. If you’re using a PIC processor that can’t run Zephyr then they might be your only option. But if you have any chance at all to use Zephyr, then you’re better off with PPP.

I’ve built applications in the past using AT vendor commands, but those days are gone now that Zephyr has PPP support.

AT commands and TCP

Before finishing I would like to point out one particularly bad combination of protocols. Here it is: TCP over AT commands.

TCP/IP has built-in checksums, retransmissions and flow control. TCP over AT commands is just not TCP. It might be used to simulate a remote serial port, but it can’t carry any serious amount of data or communicate reliably with a real Internet server.

Bit errors and dropped bytes on serial lines are a common occurrence. With AT commands, a dropped byte on the UART results in a dropped byte on the TCP connection. No applications are written to work when a byte is lost on a TCP connection. When using a real network stack those types of errors are corrected by the network layer before they reach the application. But with AT commands there is no chance to correct the error. By the time the byte has gone missing, the modem has already sent an acknowledgement to the server and there is no way to correct the error.

TLS over such a lossy TCP connection is just not viable. Any error on the UART results in a fatal error that breaks the TLS connection. There is some theoretical hope for DTLS over UDP over AT commands, which would work because DTLS does its own checksumming and handles lost packets. But TCP over AT? Don’t even bother trying.

Summary

In a contest between how we used to do it (AT commands) with how we’re doing it now (PPP and GSM 07.10), the new way wins every time. Where we’ve been using PPP, the amount of network problems experienced during development have not just diminished by some factor; they have completely disappeared.


Report from FOSDEM

I visited FOSDEM in the beginning of February. For those who don’t know, FOSDEM is the largest free software conference in Europe, attracting more than 8000 enthusiasts and hackers from all over the world. The conference requires no registration and is held on a university campus is Brussels.

This year I didn’t have a clear strategy or focus for the talks I wanted to see. The amount of talks and development rooms usually requires a strategy - popular rooms become crowded really fast and there is usually a wait outside. So to see a talk that begins at 13:00, one sometimes has to be outside the room at 11:00. The upside of this is that one gets to see talks one didn’t plan to, which is usually a refreshing experience.

I usually tell people going to FOSDEM for the first time, that if they are unwilling to figure out a strategy, then just go to one of the two largest rooms when they have nothing else planned. Those rooms (Janson and K105) are always home to relatively general topics and/or keynote tracks. They are also large enough to provide both enough room oxygen for everyone. As I didn’t have a strategy myself, that is how I ended up spending my FOSDEM, outside of food, visiting various stands, buying tshirts and talking to people I don’t usually meet outside of FOSDEM.

Apart from the more general talks I saw in the large halls, I particularly enjoyed a talk about the uselessness of end-to-end encryption in messaging apps, from the user’s perspective, given by a XMPP developer on Saturday. On Sunday, there was a big talk where the Matrix developers bragged about how great end-to-end encryption is in Matrix. While the Matrix developers acknowledged the weaknesses highlighted by the XMPP developer, their enthuisasm felt rather unwarranted, given that the main point in the earlier talk was that end-to-end encryption only benefits the server operator.

In the end, FOSDEM was as enjoyable as always, and I got myself a new pair of GNOME socks and a t-shirt, which is all I really wanted from the trip :-)


Endian ❀ Zephyr RTOS: An Introduction

Building and bringing connectivity to embedded Linux devices is kind of Endian’s thing - we’ve been doing it since the company started back in 2003.

Linux is a great choice for any device that can support it - it’s free, has great hardware support, it’s incredibly feature-rich and has an enormous, dedicated developer community that are constantly optimizing performance, adding features and squashing bugs. If your device can run Linux, it probably should.

But tiny IoT devices that wants to operate for years on a coin-cell battery can’t. Linux can be made quite tiny, but it won’t ever be tiny enough to run on a MCU with 16 kB RAM and 128 kB flash.

On these devices, the traditional solution means choosing some proprietary kernel which you then customize to your use case with homebrewed or copy-pasted code. But the requirements on modern IoT devices has made this approach unfeasible. If you want your connected device to be secured, robust, performant and power efficient, you need modern development principles: community collaboration, well-designed abstractions, modularity.

The Zephyr Project

Enter Zephyr RTOS - a project that aims to do for tiny IoT devices what Linux has done for the rest of the embedded world. The permissive licensing model, active community, feature richness, focus on security and robustness and wide hardware support makes Zephyr a great choice for modern IoT development.

We have already used Zephyr for several projects, including (the world’s first?) 6LoWPAN-connected EV charging station; a solar-powered, camera-equipped recycling bin; an award-winning self-powered flow sensor and a smart lock that uses NFC and BLE.

These products are incredibly complex, but also have strict requirements regarding encrypted communication, ultra-low power consumption and device-to-cloud interoperability. It’s true that these constraints can be satisfied using just about any RTOS, but the time-to-market likely wouldn’t even be of the same order of magnitude as Zephyr.

If you’re curious about what Zephyr can do for your company - drop us a line or give us a call. We’re always happy to help.


Report from the All Systems Go conference

In the end of September (2019) I visited the All Systems Go conference. The official slogan is “The open source community #conference focused on foundational user-space #Linux technologies”, in other words what is often refered to as “Linux Plumbing”.

My impression is that it doesn’t focus on a specific industry (unlike for example Embedded Linux Conference where you can also find alot of plumbing related discussions), but of course things might get a bit skewed if presenters are working on similar things. Companies that are heavily invested in this is for example Facebook and Kinvolk. Both with a cloud and container focus, which might not be my personal biggest interest but it’s always interesting to get a different perspective and also think about which parts can be reused for other purposes. The general feeling was that this year can be summarized like “containers, containers, containsers and (e)BPF”.

Listening in on anything related to systemd is ofcoures always interesting and this year maybe the most interesting and potentially controversial one might be about systemd-homed and the visions for it.

Every time I go to a conference I always try to go to one “wildcard” speach. I try to find an unallocated slot in my personal schedule where I find a talk about something that I might not be particularly interested in but has a crazy enough description which makes you wonder how it even fits in. This year my wildcard choice became the GNU Poke talk which was awesome and it has since been hyped by people like GregKH and described as the most impressive presentation ever seen! It’s always a great fealing listening to a talk where the presenter is both humble but also very excited about their work.

There was also a great presentation by the same person on BPF in the GNU toolchain And another talk from Facebook also related to BPF (and systemd): https://media.ccc.de/v/ASG2019-144-custom-cgroup-bpf-programs-in-systemd

The talk I brought home to my fellow embedded interested collegues where the great one about using bringing up an STM32 with only free tools, which was both a good introduction to the tools and also some deep details about how STM32 works.

Lots of work still seems to be needed still to bring docker kicking and screaming into the brave new cgroups2 worlds.

Some other notable talks I went to that you might want to look at if you want to know the latest state of things in that area:

Other videos are available at https://media.ccc.de/c/asg2019 and on Youtube. More info also available on the ASGConf twitter.


Choose a topic