• 0 Posts
  • 94 Comments
Joined 1 year ago
cake
Cake day: June 15th, 2023

help-circle



  • ----BEGIN PGP SIGNATURE-----
    iQIzBAEBCgAdFiEETYf5hKIig5JX/jalu9uZGunHyUIFAmaB8YEACgkQu9uZGunH
    yUKi7Q/+OJPzHWfGPtzk53KnMJ3GC8KQGEUCzKkSKmE0ugdI9h1Lj4SkvHpKWECK
    Y1GxNujMPRM/aAS2M97AEbtYolenWzgYmO1wt131/hEG4tk+iYeB2Sfyvngbg5KI
    y4D7mqpcVWYSf6S13vUX8VuyKeTxK6xdkp95E0wPVLfJwx5o5nH0njLXxeW0IblY
    URLonem/yuBrJ6Ny3XX9+sKRKcdI9tOqhMhTxPcQySXcTx1pAG7YE7G5UqTbJxis
    wy7LbYZB5Yy0FO3CtRIkA+cclG4y2RMM9M9buHzXTWCyDuoQao68yEVh4OdqwH1U
    5AUnqdve5SiwygF/vc50Ila6VjJ4hyz1qVQnjqqD96p7CSVzVudLDDZMQZ8WvqLh
    qaFr51xJvH6p6/CP1ji4HHucbJf6BhtSqc8ID9KFfaXxjfZHiUtgsVDYMV0e7u9v
    lhcDH/3kmw/JImX25qsEsBeQyzOJsBvxOYD3lZrwSY9+7KNGVQstFrEvCuVPHr72
    BQJPIhg3+9g6m36+9Uhs1N6b8G9DsZ6OgnNqr9dGturUg6CtRsLSpqoZq0FT9cLA
    tnFTJDaXgx1DZnsLGDSoQQYjZ3vS+YYZ8jG86KGLFyXVK+uSssvorm9YR1/GGOy7
    suaxro72An+MxCczF5TIR9n3gisKvcwa8ZbdoaGd9cigyzWlYg8=
    =EgZm
    ----END PGP SIGNATURE-----
    


  • I think having an A partition and a B partition (I’m assuming that’s how SteamOS works) wouldn’t help in this case. If the A partition downloaded the definition file, crashed and failed to reboot; the bootloader could failover to the B partition - which would then download the definition file, crash and fail to reboot. It would have to keep rolling back to a last known good snapshot until the update got withdrawn.

    You could have an ephemeral set up that wipes /var and /etc and recreates them every boot. I don’t think these EDR tools would like that very much though.




  • How would you prove that no input exists that could crash a piece of code? The potential search space is enormous. Microsoft can’t prevent drivers from accepting external input, so there’s always a risk that something could trigger an undetected error in the code. Microsoft certainly ought to be fuzz testing drivers it certifies but that will only catch low hanging fruit. Unless they can see the source code, it’s hard to determine for sure that there are no memory safety bugs.

    The driver developers are the ones with the source code and should have been using analysis tools to find these kinds of memory safety errors. Or they could have written it in a memory safe language like Rust.



  • It’s a proprietary config file. I think it’s a list of rules to forbid certain behaviours on the system. Presumably it’s downloaded by some userland service, but it has to be parsed by the kernel driver. I think the files get loaded ok but the driver crashes when iterating over an array of pointers. Possibly these are the rules and some have uninitialised pointers but this is speculation based on some kernel dumps on twitter. So the bug probably existed in the kernel driver for quite a while, but they pushed a (somehow) malformed config file that triggered the crash.


  • For this Channel File, yes. I don’t know what the failure rate is - this article mentions 40-70%, but there could well be a lot of variance between different companies’ machines.

    The driver has presumably had this bug for some time, but they’ve never had a channel file trigger it before. I can’t find any good information on how they deploy these channel files other than that they push several changes per day. One would hope these are always run by a diverse set of test machines to validate there’s no impact to functionality but only they know the procedure there. It might vary based on how urgent a mitigation is or how invasive it’ll be - though they could just be winging it. It’d be interesting to find out exactly how this all went down.







  • It’s not that clear cut a problem. There seems to be two elements; the kernel driver had a memory safety bug; and a definitions file was deployed incorrectly, triggering the bug. The kernel driver definitely deserves a lot of scrutiny and static analysis should have told them this bug existed. The live updates are a bit different since this is a real-time response system. If malware starts actively exploiting a software vulnerability, they can’t wait for distribution maintainers to package their mitigation - they have to be deployed ASAP. They certainly should roll-out definitions progressively and monitor for anything anomalous but it has to be quick or the malware could beat them to it.

    This is more a code safety issue than CI/CD strategy. The bug was in the driver all along, but it had never been triggered before so it passed the tests and got rolled out to everyone. Critical code like this ought to be written in memory safe languages like Rust.