December 07, 2006

Can you really trust what you read on blogs?

Sorry I haven't posted lately. I have been totally wrapped up on a project that has been sucking the marrow out of every bone in my body. I have had to bring myself up to speed on a lot of technology I have been happy to avoid over the last decade, only to have many interesting challenges arise from those decisions over the years. I consider myself a fairly competent software architect... but when I get stuck, man do I get stuck. I put a product we expected to launch last month at least a month behind. *sigh*

I wanted to share a story about a challenge I had for sometime. A failure that has taught me a clear lesson. It took me over a month and a half to conquer this problem, which I am sure many of you could have solved quickly by making different choices in the design phase.

My first mistake was that I put blind trust in people who were NOT in my circle of influence. I ASSUMED that they had real world experience in what they spoke of, only to find out after the fact that this wasn't the case. This bit me both on public Microsoft newsgroups and with MSDN blogs, where employees at Microsoft made statements that were found to be OFFICIALLY unsupported, and in a few cases, UNTRUE or not properly tested. That is a problem with blogs; people give their own experiences... and we the reader don't always know where their experience comes from. We just take the words as fact.

In my case I was directed by various Microsoft employee comments in newsgroups and a few blogs that I could write a managed COM component in C# to take advantage of web services and WS-* security, and that I could then use it in unmanaged C++ code. Well.... that's not entirely correct. That is true of standalone applications in user mode. But go ahead and try to do it in services like IIS or IAS, where you CANNOT do INPROC COM interop. You come up to loader lock issues which can freeze the system with a massive deadlock and isn't supported in the currently released Windows kernels. This is a KNOWN but undocumented limitation by Microsoft, and something being addressed in Longhorn. And I only found this out after 4 different PSS cases were opened in the last month, where the support engineers at Microsoft were solving symptoms of the problem, and not the actual root cause. When I finally got escalated to a guy on the Redmond campus in charge of this sort of stuff, only then was I clear about the REAL problems, and how to really solve it. I was so pleased to work with him; he made everything clear to me without making me feel uncomfortable. Of course I had to re-architect a huge piece of our system to deal with this issue. But I am not fretting about it; the new design is much more robust... and more reusable for other stuff we are doing anyways.

What did I really learn here though? Well the first thing is that I am NOT a COM guru. I already knew that... but never realized how much I still had to learn about COM and interop issues with managed code. The last month has NOT been fun as I tackled the learning curve. More importantly though... I learned that you cannot blindly trust what people say on blogs or in newsgroups, no matter WHAT domain they belong to. I trusted that a *@microsoft.com employee comments on newsgroups and MSDN blogs were golden... I should have known better than that. That is not to say these people aren't smart; it's just that you have to understand what COULD work in theory doesn't always work in practice. "Patterns and Practices" people got that down cold. They are the people I should go to in the first place.

For those that are interested in what I ended up doing, here was the solution to my problem:

  1. Write an unmanaged COM DLL (C++) that services like IIS and IAS can call directly INPROC.
  2. Have that DLL communicate via DCOM RPC (LOCAL) to a Windows service, ALSO written in unmanaged code (C++).
  3. Have that Windows service load up the managed COM component (C#) via INPROC, where it is safe to load the CLR.
  4. Voila. You can now call into any managed code without loader lock issues.

Ya, pretty simple looking back. But when you are told by various sources that you can call the managed COM component directly, you just don't think you need all the middle tier comms. Wrong.

Lesson learned. Time to load up Bloglines and go catch up on all those blog posts I haven't had time to read lately. *chuckle*

Posted by SilverStr at December 7, 2006 10:26 PM | TrackBack
Comments

Uh, you scared me with the IIS and managed COM server stuff. I am porting our native Delphi SMTP protocol event sink implementation to C# to overcome some 64-bit issues (Delphi does not compile 64-bit) and so far I have not experienced deadlocks on load, but after all, IIS SMTP Service is part of IIS, so this may apply my case as well. What do you think?

Posted by: pb at December 7, 2006 11:55 PM

Painful, it reminds me of the statement; "the domain is the security boundary". I wonder how many companies designed their networks based on that...

Posted by: Patrick Ogenstad at December 8, 2006 12:58 AM

I'd like to get a boiled down version of this problem to reproduce - since you're not in my circle of influence ;).

Can you go into a little more detail about how to reproduce the issue? Does the problem happen only under load? How many managed COM objects are you creating? Do the IIS settings matter?

I'd love to build an uber-simple version of this problem, and throw some load at it to get some well defined guidelines.

Posted by: Joe Rustad at December 8, 2006 08:13 AM

Hey Joe,

Sure thing. Write an IAS extension DLL. Have it do a INPROC_SERVER COM call to a managed COM DLL written in C#. No load is needed. When you call into the DLL, it locks up the service, and everything else in that service (svchost.exe is NOT a forgiving sort of fellow).

The reason for the problem is that when you call CoCreateInstance() you are trying to load the CLR into the same memory space as IAS, which isn't allowed. Basically, you cannot guarantee what thread is doing what within the service, yet the p/invoke calls are based on thread affinity. Although it IS possible to do this, it can cause a lot of unknown and undocumented problems. Hence why it isn't supported. And as usual.... my luck tripped over what happens when a problem DOES occur.

Posted by: Dana Epp at December 8, 2006 10:38 AM

Don't beat yourself up over listening to someone(s) that happened to be wrong. It can happen even in high-brow professional expert circles. People can be wrong.

But you could warn all those people trusting Wikipedia too. ;)

Posted by: LonerVamp at December 8, 2006 12:05 PM

Oh!! that is really a long procedure to get to successful path! I thought you will get it in short way, when I transferred, but seems you've moved from india till redmond!!

I certainly understand the pain you would have gone through while redesiging your whole architecture.. But great to know that your problem is solved now!

And, thanks for sharing the info. It will surely help other people.

Posted by: Jigar Mehta at December 8, 2006 04:01 PM