The weirdest QNX bug I've encountered (2021)

175 points by fanf2 a year ago

Animats a year ago

"At this point, an intermezzo with some QNX history is in order. A bit more than a decade ago, the QNX source code was available to the public. Back then, QNX had a vibrant open source community. People would experiment with the kernel, write various useful utilities and help each other in forums. QNX even had a fully featured Desktop GUI, ran Firefox and was self-hosting, so you could develop for QNX right on QNX itself with full IDE and compiler support. It was beautiful."

"Then QNX was bought, source code access was revoked and the community largely withered away. Questions were increasingly asked via private support tickets directly to QNX, locked away from the public. QNX know-how becomes harder and harder to acquire, open source software for modern QNX releases is essentially non-existent and the driver situation is a catastrophe. The QNX kernel is the most beautiful and interesting kernel I have ever had the pleasure of working with, but it lies in the shackles of corporate ownership."

It's sad.

QNX was originally an independent company. During that period, anyone could get a free copy of QNX for personal use. It wasn't open source, but it was available. It's POSIX-compatible, so it was a supported target for Gnu, Firefox, and Eclipse. We used QNX for our DARPA Grand Challenge vehicle in 2003-2005, and all that code was developed on desktop QNX.

Then QNX was acquired by Harmon, the successor to Harmon-Kardon, which once made home audio components and pivoted to car audio. They were thinking car infotainment. Harmon didn't really know what to do with an operating system, especially since the big market was systems for industrial control and point of sale. So eventually they opened the source.

Then QNX was acquired by Blackberry, the early smartphone company. They closed the source, very suddenly. They even killed off the free version for personal and educational use. So all third party open source development stopped. Blackberry eventually shipped a phone that ran QNX, but they were not powerful enough as a company to keep a third phone standard going. So Blackberry went to Android.

Blackberry killed off the self-hosted desktop environment, and users now had to cross-compile from Windows.

And QNX became more of a niche product than ever.

akira2501 a year ago

> but they were not powerful enough as a company to keep a third phone standard going.
They absolutely were, which is the tragedy of the whole thing, people absolutely loved their products and strongly preferred them to everything else on the market.
Instead of recognizing the game changer that the iPhone was they slept on the market and didn't do much to bring big touch screens and rich internet applications to their platform.
It was a slow and agonizing death.
- jasoneckert a year ago
  
  They may have been powerful enough in size and monetary resources, but not in process, structure, or vision. RIM (later BlackBerry) was a tech company that wasn't really run like a tech company - it was run like an insurance company, with a full fledged bureaucracy governed by their legal department. And as a result, when competition from Apple and Google came, they couldn't pivot to compete beyond a snail's pace and faded into the sunset very quickly.
  - tejohnso a year ago
    
    I don't think they had much of a chance with that level of competition. Apple was able to completely shift the market from RIM's strength, which was their full physical keyboard. The screen was for basic productivity applications, and messenger, not for input. There was no app ecosystem. It took years before Blackberry came out with a touchscreen version and it was lackluster. By that time most people who weren't diehard RIM keyboard lovers had already switched anyway.
  - Varriount a year ago
    
    Do you know of any good sources to read about this (the internals of RIM/BlackBerry and its downfall)?
    
    ExoticPearTree a year ago
    
    There's a pretty nice movie about Blackberry's rise and fall: https://en.wikipedia.org/wiki/BlackBerry_(film)
    
    jasoneckert a year ago
    
    There's also a great book on it too: https://www.amazon.ca/Losing-Signal-Spectacular-Rise-BlackBe...
- asveikau a year ago
  
  > Instead of recognizing the game changer that the iPhone was they slept on the market
  But they made that error long before buying QNX. By the time they bought QNX they had probably lost too much ground to turn it around.
  - Tsarbomb a year ago
    
    Hard disagree. While they had lost a ton of ground people were still sporting blackberries in large numbers. The absolute bone headed move they did is they continued to promote and sell their legacy models front and centre cannibalizing their own future growth.
    There were other issues that were much smaller in the bigger picture like android app support on BB10 coming a little late as well as the devices in general had a somewhat underpowered SoC. All of these contributed to slow adoption, but the fact they were promoting new models of the Bold while their warehouses were full of the Z10 was really what did them in.
    
    asveikau a year ago
    
    > The absolute bone headed move they did is they continued to promote and sell their legacy models front and centre cannibalizing their own future growth.
    That's kind of exactly what I meant, though.
    Nokia made an almost identical mistake, before the Microsoft acquisition, by not doing a timely migration path from Symbian to Meego. The old business was a cash cow so they kept it going, not taking the new thing seriously.
    
    justsomehnguy a year ago
    
    > promoting new models of the Bold while their warehouses were full of the Z10
    Not familiar with BB lineage I checked the Wiki for the release dates and ... wow.
    Not quite an offtopic: people bash me when I point out what Burning Platform is the result of the previous Nokia shenanigans, so blaming it all on Elop is like blaming the fire for charring your steak you forgot on the BBQ.
- lproven a year ago
  
  > They absolutely were, which is the tragedy of the whole thing, people absolutely loved their products and strongly preferred them to everything else on the market.
  Concur. I had a Passport and it was a deal-changing sort of device.
  The universal inbox was amazing technology, which totally reworked how a pocket comms device worked. Sadly as companies gradually dropped BB10 support, although you could run Android messaging clients, those didn't integrate with the BB10 inbox, so gradually the phone got less and less useful.
  This should have been a reason to demand open comms protocols and legally require all comms-tools vendors to support 3rd party clients.
  Today, I use Ferdium to connect almost all my comms apps into one client app, but it's just a bloated Electron thing that forcibly unifies dissimilar web apps. It's clever but it's not really a solution. It's a herd of elephants on a seesaw in order to crack a peanut.
  Never mind laws about cookie banners, it should be a procurement requirement for all government agencies in the free world that all vendors must support a documented protocol so that native client apps can connect.
  To Slack, Whatsapp, Signal, Telegram, Google $ChatAppOfTheMonth, MS Skype|Teams|Whatever.
  No web apps, no Javascript, just mandatory lowest-common-denominator local rich clients, so that tools like BB10 could talk to everything.
  Frankly I don't care if it doesn't support sound or video. I will go quite far to avoid that anyway. But text, basic formatting, smilies, maybe embedded static image files and attachments.
  Nobody would lose from this.
- whitten a year ago
  
  Weren't Blackberry phones almost a requirement for USA government computers ?
fouc a year ago

I've always thought QNX 4.24 w/ Photon microGUI should've been fully open sourced, even a decade later. It would've been competitive with linux in the desktop OS arena.
grishka a year ago

> They closed the source, very suddenly.
I'm very surprised that there aren't any projects based on the last available open-source version. This sort of thing usually happens when the company behind an open-source project that has a large community does something stupid to the project. Mapbox is one such example.
- rcxdude a year ago
  
  It was never actually open-source: it was only source-available. So when they stopped making it available there was no legal basis for continuing. (this is why the definition of open source matters!)
tecleandor a year ago

Sorry for the pettiness, but Harman, not Harmon ;)
kragen a year ago

it's unfortunate that qnx wasn't open-source; revoking source code access would have been impossible
- Animats a year ago
  
  It was, for a while. Blackberry did revoke QNX source code access. The original poster had a copy around from the open source era, and thus was able to fix "ps".
  I once told a QNX sales rep that their problem was not being pirated. It was being ignored. Today, I'd say "forgotten".
  - kragen a year ago
    
    no, it never was. the qnx source code was available, but it wasn't under an open-source license. if it were, given the number of fans it had in its heyday, there'd be at least one live qnx fork
    https://github.com/vocho/openqnx/blob/master/trunk/lib/c/1/o... quotes the license as follows
    > You must obtain a written license from and pay applicable license fees to QNX Software Systems before you may reproduce, modify or distribute this software, or any work that includes all or part of this software. Free development licenses are available for evaluation and non-commercial purposes. For more information visit http://licensing.qnx.com or email licensing@qnx.com.
    
    Animats a year ago
    
    Right, it was never under an open source license, but you could look at the code.
    Re-implementing the QNX 6 kernel in Rust would have been a nice project. It's only about 60 kilobytes of code.
    All the kernel does is pass messages around, dispatch the CPU, and run timers. All device drivers are in user space. You can build a boot image with whatever processes you want running at startup, so you can have device drivers at boot.
    For smaller embedded applications, everything might be loaded at boot. You don't have to have a disk or file systems. There are embedded real-time applications where having zero persistent state is desirable.
mavhc a year ago

The only way a second standard can survive is by being open source. A 3rd? Very unlikely
- whitten a year ago
  
  iOS isn't open source. Is Android by Google enough open source ? Or are they not the two standards you are thinking about ?
  - fragmede a year ago
    
    > Is Android by Google enough open source ?
    Android is open source, leading there to be several Android-based distros like postmarketOS and popOS. they are able to take advantage of the android software ecosystem and so are able to have apps you'd actually want to run on them
    
    BSDobelix a year ago
    
    Non of these two are android based, but DivestOS or LineageOS are a fork's of AOSP.
    https://en.wikipedia.org/wiki/List_of_custom_Android_distrib...
    
    m4rtink a year ago
    
    I don't think PostmarketOS or PopOS can be considered to be an Android ditro - they are IMHO traditional Linux distros targeting mobile devices.
    
    lproven a year ago
    
    postmarketOS is based on Alpine Linux.
    Pop!_OS is based on Ubuntu.
    There are lots of AOSP-based mobile OSes, including LineageOS, Graphene, Amazon's Fire OS, CalyxOS and /e/ OS.
    https://en.wikipedia.org/wiki/List_of_custom_Android_distrib...
    But neither of yours are.
    
    fragmede a year ago
    
    Thanks, I couldn't be bothered to look it up and I knew Cunningham's law would kick in. :)
  - mavhc a year ago
    
    Google made Android the 2nd standard phone OS by offering it to everyone as open source

arsome a year ago

I actually ran across this issue myself, SIGQUIT'd the process, loaded it into a debugger and found the exact same problem. I can confirm the problem still exists on QNX 7.1. Fortunately we were moving off it, so I didn't think much more about it, but glad someone wrote it up.

nrclark a year ago

QNX really needs to modernize if they want to survive. Their tooling ecosystem is stuck in 2008, and their kernel's performance is pretty low. IIRC, the kernel itself is also single-threaded, and can't take advantage of multiple CPUs (even if tasks can be SMP scheduled).

Their moat is supposedly their ASIL certification, but I see that value shrinking more and more over time for the following reasons:

1. If your product has a software-related failure, customers won't care about all of your certifications. Only the end product.

2. I'm not convinced that the QNX kernel is less buggy than the Linux kernel. Also, most failures don't tend to be kernel related.

burstmode a year ago

>If your product has a software-related failure, customers won't care about all of your certifications. Only the end product.
If you're in a market where a ASIL certification is needed, the customers ONLY care about this certifications. I keeps them out of jail.
- nrclark a year ago
  
  Can you point me at some more detailed rules that support your assertion? Not trying to argue - I’m actually interested to read more details on that.
  - rwmj a year ago
    
    It's the reason why some companies, like IBM [disclosure: I work for Red Hat], seem to sell products even though there seems to be little rational reason why customers would buy them, as in they have poorer performance or quality at a much greater price. Those products are certified against dozens of financial, safety, security or other standards, and customers in certain markets (government, military, nuclear, automotive etc) simply have to buy the certified products. The consequences of not doing so range from products not being supported, all the way to going to jail for gross negligence.
    Edit: I wrote a rather highly rated HN comment about why Red Hat makes money last year: https://news.ycombinator.com/item?id=35588297
    
    jeffrallen a year ago
    
    Another example of this is FIPS-140 crypto. It is objectively bad crypto in the 2020's. But it's mandated in some settings for either bureaucratic reasons or due to regulatory capture.
  - throwaway173738 a year ago
    
    It’s not really a rule, but rather in some environments you have to be able to say in court that you did everything you could to make sure your software worked safely and correctly. Sometimes you will be risking criminal charges if you can’t.
  - f1shy a year ago
    
    The truth is, too many managers have never read the ISO document, and follow the CYA methodology, and ask for everything to be certified. The ISO just says (bare with me with this stupid simplification) “do whatever you want, but make sure p(disaster)<1e-20. You have to be able to justify decisions, but will not helt having certified frameworks, os, and tools, if you did a bad FMEDA
    
    szundi a year ago
    
    Following this logic it seems to be a good choice to buy RHEL because you have no chance running linux with those probability margins that you just wrote. Electronic components might have those. So stay out of jail
- f1shy a year ago
  
  There is NO market where “ASIL” is required. Of course if something happens you better have a safety case as described in the ISO26262, or a good excuse. That being said, that a system has a safety case according to ISO26262 ASIL D, does bot mean at all that all pieces must be certified.
  Currently working in a project where ASIL D is reached by having an independent microcontroller, whatching out the whole QM mess.
  - foooorsyth a year ago
    
    >There is NO market where “ASIL” is required
    Define “required”. If every single legal department at every single major automotive company says “we must obtain ASIL-B certification for our gauge cluster software or we can’t sell cars”, does it matter if regulators don’t overtly mandate it? The legal environments of all of the major automotive markets make it a de facto requirement.
    
    f1shy a year ago
    
    The ISO26262 was defined by the automakers themselves (almost all were represented in the committee) so yes, they want to follow it. There is no legal requirement. It does not specially help in case of litigation either.
  - rcxdude a year ago
    
    It's not legally mandated, but the dynamics of the regulation and the risk-averse nature of companies mean that it's effectively become a requirement to compete: if you don't have it, you're only going to sell to the rare company that is willing to stick their neck out and deal with novel arguments in the paperwork themselves. For commercial aerospace that is none of the manufacturers.
    (someone else might come along and certify it themselves, effectively acting as a middleman, but then they're going to get most of the money)
cbsks a year ago

QNX Muon removed the global kernel lock so it has much better multi-core performance.
- jeffreygoesto a year ago
  
  https://www.qnx.com/developers/articles/rel_7063_0.html#what...
Beretta_Vexee a year ago

There is a very large installed base of machines running on QNX, including medical equipment, radio communications, railway switches, automotive modules, etc. Most manufacturers are upgrading in small steps and absolutely do not want to start from scratch with their software layer. Most manufacturers are upgrading in small steps and have no desire to start from scratch with their software layer. At best, they will add a touch screen, an arm processor and an Android system, but only for the interface. The critical parts will remain under QNX. The same is true for VxWorks, a large proportion of industrial PLCs, electrical networks, water and sewage systems depend on VxWorks. QNX and VxWorks are omnipresent but invisible systems.
foundry27 a year ago

What kind of tooling modernization would even make a difference?
At least SDP 8.0 overhauled the kernel to not be locked to a single thread anymore, which is nice IMO
agustamir a year ago

> kernel's performance is pretty low
Can you elaborate on this? How is it "low"
- NoahKAndrews a year ago
  
  They did, in the next sentence

bxparks a year ago

I counted 417 comments on that page and scrolled through a few dozen. Every one of them was spam. That's pretty much the internet these days isn't it.

Other than that, the blog post was very interesting, I learned a bit of history of QNX, and concluded that I should avoid it.

OnionBlender a year ago

Is it though? I don't remember the last time I saw so many spam messages. Most sites I visit do a better job of preventing or removing spam.
ronsor a year ago

The spambots are literally having conversations with each other.

the_panopticon a year ago

I recall trying to debug a crash in QNX during the mid-90's. I was impressed by the svelte OS that could load from a 3.5" floppy. The failure scenario was coincident with one of my first tasks as a BIOS engineer and it entailed adding some custom error logging in System Management Mode (SMM). Luckily for me it turned out that I had forgotten to save/restore certain general purpose registers around my SMM logic. Fun times. SMM is pretty good at 'breaking' operating systems :)

ragnot a year ago

Every developer I know (myself included) that has worked with QNX has a story about some insane bug that took significant effort to uncover. At this point, I would say the only reason one should look at QNX is for cost since it is pretty cheap. The low jitter on context-switching to the highest priority thread is a nice thing but the dev process is absolute garbage.

fargle a year ago

yep. it's really trash. used it 20-25 years ago when they were just introducing "neutrino" vs. the classic QNX4. the former had a good rep with auto and medical usage.
- very bad port of GCC was buggy and generated bad code. it was the result of some idiot blindly haphazardly applying a ton random incoherent patches to try to get it to build instead of porting it properly. (to be clear mainline GCC at that time was fine; we re-ported it ourselves instead) - and, of course, they used their own faulty compiler to build their libraries and services ;) causing unknown carnage waiting to be discovered.
- malloc broken (heavy use under multiple threads causes heap corruption). replaced with dlmalloc.
- serial port driver broken. rewrote a new one.
- intel network card driver crashes. replaced hardware with 3com to survive.
- certain math library functions broken (iirc, fmod). replaced.
and so on and on.
it doesn't matter what certification your RTOS (or whatever) has. if you cannot examine the source, rebuild it, etc. (oss or private source available) - it cannot be trusted. this was one of the worse examples, but it's always like this with "proprietary" OS/toolkits.
nubinetwork a year ago

When you can tell qnx to give you a root shell as an unprivileged user, ps being slow doesn't surprise me... https://www.juliandunn.net/2006/08/21/on-hacking-the-unisys-...

banish-m4 a year ago

What I like about seL4 (although not a complete embedded dev platform) is formally-verification. QNX might have EAL4 in some configurations, but like most every other operating system on the planet, they haven't bothered to up their game by formally verifying it for correctness. This is a shame and entirely preventable with greater attention to testing and verification.

kragen a year ago

what sel4 shows is that it's entirely preventable with a rewrite from scratch by a team of formal-methods ph.d.s over many years, if they invent a design that allows the kernel to only be a few thousand lines of code so that the titanic effort of formal verification becomes feasible, barely. it's not something you can do with a codebase of millions of lines of code or a codebase that wasn't written from scratch with formal verification in mind. yet, anyway

redtriumph a year ago

https://news.ycombinator.com/item?id=29207216

lfkdev a year ago

What is going on with the comment section on this post?

ralferoo a year ago

I actually thought some of the comments were funny. Especially the one about the crab in the shell! No idea why they thought it was related to QNX, but an insight into the mind of spammers nonetheless.
- exikyut a year ago
  
  That particular snippet was posted four times if you ^F.
  It's really interesting to see the current "state of the art" in terms of the types of bots that get past the particular CAPTCHA implementation this site uses.
  It's a very rudimentary type of CAPTCHA, the kind that anything developed within the past 5 years would probably get past with at least 30% accuracy (logarithmically skyrocketing to >90% within the past ~2 years).
  So the post quality is somewhat distributed across a spectrum - on one end, dumb CAPTCHA OCR/processing <=> "slightly better than Markov chain model replication", and at the other end, clearly more sophisticated systems that more easily pass the CAPTCHA and generate more interesting posts.
  What's curious is that 90% of the comments are rudimentary. There are very few interesting spam posts. I'm trying to figure out what to make of this.
  I'm picturing some majority of utterly outdated spambot infra, still out there, scanning the Web for WordPress/XSS-level stuff, and finding success on blogs like these... and that these old bots are the only systems of their kind out there, because all the spammers collectively gave up with reCAPTCHA and CloudFlare protecting almost all meaningful concerns, with moderation following not too far behind.
  Kind of makes sense.
  But it's really depressing to compare these old clunky bots that are kind of cute (in a way) to the upgraded versions - the current-era tech, that get past moderation... and effectively pass the Turing test :'(
  - ralferoo a year ago
    
    I hadn't noticed multiple copies of the crab one, but had of various others. To me the fact there are so many duplicates makes me think those texts, if not written by humans, were selected by humans.
    If it was truly AI-generated, I'd expect a random seed as part of its input, and unless there was very little entropy, I'm not sure it would chance upon the same exact formulations over and over. Maybe they hadn't tested the randomness aspect well in the training and it'd learned not to attach much weight to that beyond the first word or two.
  - phito a year ago
    
    > What's curious is that 90% of the comments are rudimentary. There are very few interesting spam posts. I'm trying to figure out what to make of this.
    They mostly make positive comments about the website/article/author to have less chances of being deleting. The end goal being linking their website to improve SEO.
  - fragmede a year ago
    
    the problem with captchas is that spammers will just pay a human in a very-low cost of living country to actually do them.
0l a year ago

Link farming, check out the usernames - they are links to third party sites in an attempt to fool search engine rankings.
Jolter a year ago

Clearly they are not implementing good bot protection. The results are not very surprising IMO.

tfrutuoso a year ago

Great article, but the comments section of that blog is pure cancer. Jeez.

torginus a year ago

Honestly this kinda shows me that no matter what degree of robustness we design into our systems (null saftety, memory safety, thread safety etc.), some types of system breaking bugs are unavoidable (such as DOSing the system by calling a system API function in an infinite loop), and are often impossible to distingushing from desired behavior.

KerrAvon a year ago

Being able to DOS the system by calling an API should be considered a bug in the system. There are mitigation strategies for such things. If you can’t (for mitigation purposes) distinguish a DOS from desired behavior, there’s something very wrong with your architecture.
edit: added parenthetical
smaudet a year ago

How about no infinite loops (as a start)?
Unless you are the kernel, and you can demonstrate that your loop is "safe" via some set of static analysis.
- torginus a year ago
  
  Infinite loops are not that harmful by themselves if they only spin the CPU. The scheduler can schedule other work on the core if the process' timeslice runs out, or it results in a frozen application if said infinite loop happens on an event loop thread.
  Neither are pleasant, but they don't compromise system integrity and as such are not substantially different from other kinds of crash bugs.
  - smaudet a year ago
    
    > Infinite loops are not that harmful by themselves if they only spin the CPU
    Categorically that's the exception, not the norm.
    For always-on systems, "by themselves" is pretty critical, as they can result in DOS style bugs/attacks as illustrated in the parent article. The infinite loop is most useful e.g. for the scheduler or other event loop code, which by definition this style of code does a lot more than "only" spin the CPU.
    For systems capable of long-term power saving operation - the system can go dormant (either no power draw or very little). An infinite loop can be the difference between weeks of power off a battery, or days/hours.
- banish-m4 a year ago
  
  Achievable with formal verification and soft- and hard-realtime worst case timing validation. It's not impossible but also not easy. It requires significant engineering investment.
- Tsarbomb a year ago
  
  Are you suggesting a general solution to the halting problem?
  - jamesmunns a year ago
    
    The usual solution, including in safety critical systems, is to give the judge of the halting problem a stopwatch and a gun.
    For example in an embedded system: a watchdog timer that you don't service during the execution of some context. If you fail to complete your task within the time, the kernel or entire system is rebooted.
    For example in a VM-like system, you give the code some amount of "fuel" or "budget", if it exceeds that budget, the process/tasklet/whatever is terminated by the VM.
    It's not a general solution to the halting problem, but a practical one.
  - fragmede a year ago
    
    funnily enough, ChatGPT can examine code and can say if it will halt or not, in some small cases.
banish-m4 a year ago

That's shrugging off inferior processes and substandard work. Bugs are unnecessary because there is a finite amount of code and they can all be eliminated if the correct eyeballs spend sufficient time reviewing, testing, and simplifying behavior to focus on robust reliability and correctness. Breaking changes can be allowed in dev packs with semver release notes. There are no excuses for sloppy engineering.