September 30, 2004
Why Threat Modeling matters
Today I thought I would drop a post about WHY threat modeling matters. I recently came across an incident in my own software where quite frankly an unknown vulnerability existed because in my mind, a threat model was not being completely thought out. I am not ashamed to admit this was a failing on my part, especially since I lead the entire process. I could easily have swept this under the rug since I caught this before the product was released, but I think there is a good lesson to be learned here that I would like to let my fellow secure software architects see what I have learned here.
What is funny about this is it was a compounded failure based on the traditional old adage of 'get it to market'. In the spring we launched a pre-release program of our Intrusion Prevention System with great success. The software did exactly what it was intended to do, and hold that to UI enhancement issues and a conflict bug with Symantec's antivirus product (unfortunately a COMMON problem in our industry), everything was looking good for an end of summer launch.
Routinely I hold what Joel likes to call 'hallway usability tests'. Basically you pull people in that have no clue about your software and let them muck with it, and watch how they deal with it. I try to do this on any major new feature I add, but since this was a new product I did a complete software test like this with three different groups of people.
I started off by getting it in front of a small .NET user group. The reason simply was that this was my first serious attempt of using C# in a standalone app and I wanted to learn about UI design principles as it related to the platform. The testing went ok, and I quickly realized the UI needed an overhaul... which ended up converted the application to use Microsoft's new Inductive User Interface.
The second set of tests I did was at a local University. I basically monopolized a class and had students try to cause the application to die. Rewarding the top student with the purchase of his or her next term books really seemed to drive it home and I had a lot of fun watching them bust the application. I walked away with some interesting lessons and quickly made necessary changes to deal with that. Although I had a lot of failure code paths for tainted data injection, what I didn't account for was data that matched regex, but simply didn't properly exist. (ie: A file path that was valid, but didn't exist in the proper form. Although I checked to see if the file existed, what I didn't do was VALIDATE how the system did the check. More about this later)
Then I did something stupid. I let feature creep walk in and modified my path testing code WITHOUT properly updating the threat model. What the fix did was expose me to a different more serious vulnerability, which I didn't figure out till after my third set of usability tests.
For my third set of usability tests I invited some colleagues and friends from various industries who were REALLY computer savvy, but were not security administrators. I asked them to beat on the thing silly, and ran a contest for most critical bug found and most bugs found. I also turned the night into a full out LAN party in an effort to get these guys to come out and really get into the nitty gritty of ripping the thing to shreds.
One of the incidents reported didn't immediately become obvious until I was reviewing the bugs in the lab that it was so critical, that it would have me HALT product release until a serious architectural change was made. And you wanna know why? Because I got sloppy and didn't update the threat model when I came across a scenario that I knew exposed the product to greater risk.
Now to be fair, this isn't 'sloppy' as in being negligent. The potential of the original bug from occurring was so small that during risk analysis I determined that the user would already have high enough privileges to be able to turn off or bypass the safeguards for this to occur. Basically you could bypass the entire IPS by simply renaming volumes on the system. ie: if you were protecting the 'c:\windows\' directory, if an adversary remapped C: to D: (making the %systemroot% dir be 'd:\windows' dir) the rules would never match and therefore completely give access to the dir. Such a volume renaming isn't something that commonly occurs... it could possibly break the system if done incorrectly. And only an Administrator could do that... something that would mean they have enough privileges to disable the IPS anyways. But the risk was still there.
Point was that I learned about this and decided that the best course for action was to write code to watch for volume remounts, and then internally REMAP the paths so that it could handle the situation. What I SHOULD have done was do a proper data flow diagram and then a detailed attack tree to look DEEPER into what the real threat was... issues relating to invalid file name pathing. This would in turn come to haunt me later. *sigh*
During the third set of tests, one of the testers came across an interesting situation which relates to short file names (SFN). Basically he stumbled across an interesting problem in which a pathing vulnerability existed with a SFN to long filename (LFN) conversion. It was actually kind of kewl as far as bugs go. If you opened up a target file under protection in notepad, it would properly be DENIED. However, if you opened up the same file in wordpad, wordpad would aggressively try to open it 5 times, and after failing would then try to open it up using different pathing... including using SFN pathing. (You know the naming convention, where they use MYDOCU~1 to represent "My Documents"). Walla... a way to bypass the pathing rules in the IPS. And a situation NEVER accounted for in the labs since we have NTFS disk configurations set up to ALWAYS guarantee LFN. (Didn't learn about that until doing a post-analysis of the whole situation... another practice which you really should do after finding critical bugs)
After realizing that during the third set of tests a major issue with pathing was found... I went through the process of completely threat modeling that entire area of data entry using different scenarios and realized that my kernel code simply wasn't capable of properly handling the conversion. And before you start taunting me and saying "pffft.. just use the path conversion APIs" let me be clear... there is no such luxury as the Win32 GetLongPath() function in the Windows kernel.. you have to roll your own code WITHOUT causing recursive IO. And it wouldn't have dealt with the volume renaming anyways. Actually, there is very FEW luxuries in the kernel... which is why most of us that write kernel code look like we belong in a rubber room, with our hair frayed (if we have any left!) and the 4 o'clock twitch. That's an entirely DIFFERENT story :)
Anyways... the routines to do path matching made no sense, and the volume renaming code was clunky, and prone to weird issues in foreign file systems. (One pre-release customer had a proprietary encrypted file system which I just couldn't properly get control of). The threat model exposed the REAL issue, and I did the one thing that my Board of Directors really didn't want me to do. I halted release of the software. I had to. This was a major architectural flaw and the time to deal with it was now... before the release of the product. In the long run this would SAVE us money. And credibility. And trust. And they agreed.
So I spent a month redesigning the entire data entry routines that handled pathing. With thanks to Neil from Microsoft and Vladamir from Borland (thanks again guys) I found a way to map the files right down to the device volume that is guaranteed to always be valid, and I then rewrote the SFN->LFN conversion routines to properly address any sort of conversion issue. In the end, I fixed a potential larger problem in the future that could actually lead to a real attack vector to bypass the system. Very small chance, but still one I knew about. And you simply CANNOT ship critical software that has a potential vulnerability like this.
When reflecting on this whole experience, besides showing that I am human and prone to mistakes like the next guy, what I learned from this is NEVER cut corners. I knew better. It should never have happened. Doing so ended up causing MORE delays in the product launch which ultimately affects our bottom line. However, what I am PROUD of is that my Board supported my decision to halt the launch. They fully understood and respected my decision that the long term impact of releasing a product that has a serious architectural flaw could very well expose our clients to unnecessary risk which is unacceptable. And something I just won't do.
We also get the long term benefit of fiscal responsibility in the software design and deployment characteristics of the product; it would cost us MORE money in the long run in engineering change requests, updates, education etc if we launched and then had to totally redesign it with backwards compatibility with what was in the field. It only took my half a day to write an immediate patch/fix for the existing pre-release clients since it was a small group of sites. Not something you can easily do when you have tens of thousands of installs.
Now, I am not saying that anyone and everyone should HALT an entire company when you find a serious flaw. And I know there is a huge difference between writing the next greatest RSS reader verses something like an intrusion preventions system used to safeguard critical business resources. It may not be practical to halt a release, especially when you have a tonne of installs already; at that point you need to get out there and immediately protect whats in the field. However, it does give me more insight to some of the decisions Microsoft made when halting development to re-educate their developers. And roll out XPSP2. And get out Longhorn.
Although you cannot see it right away, delays may actually be MORE RESPONSIBLE than releasing software at risk. Delays may actually SAVE YOU MORE MONEY in software re-engineering costs. We are not in an engineering discipline where everything can be guaranteed to be 100% safe. People always try to use the analogy between software and how bridges are built. I think its not a fair analogy. Engineers have had CENTURIES to work on that discipline. We are not even 30 years in the making for practical software design. However, that doesn't mean we shouldn't be RESPONSIBLE for our software. Secure software engineering as a discipline may still be in its infancy... but we shouldn't ignore it. Doing so puts everyone at risk.
The threat in our IPS software is now mitigated. The product is again on its release candidate cycle and I can now look back and reflect on dumb decisions that I made, and the impact that it caused on my company. Nothing too critical; I adapted fast enough to make the right decisions and get the company back on track. And I look to things I have learned from people I respect like Gene Spafford and realize that ultimately I made the right decision. Security is a property supported by design, operation, and monitoring... of correct code. When the code is incorrect, you can't really talk about security. When the code is faulty, it cannot be safe.
Fixing the architectural flaw was the right thing to do. Hopefully our clients will agree. If I would have just threat modeled that area when the first set of bugs came in, I would have found that and saved everyone a lot of headaches.
Hopefully you learned something here. I sure did.Posted by SilverStr at September 30, 2004 02:55 PM | TrackBack