Monday, 21 April 2014

Transition to 64 bit - Pain or Panacea ?

Once in a while the 64 bit moniker raises its head and promises to administer a much needed testosterone boost to whatever is ailing the computing industry. Most recent example is Apple with its IOS 7 release. Yet the facts are far different detached in most cases with this marketing hype created.

64-bit at a programming level comes with the promise of more registers in the CPU, the ability to process larger data (64 bits at a time) than 32 bit and ofcourse an understated/unspecified improvement in the basic CPU architecture (say removing flaws of the 32 bit architecture). To be frank the first two capabilities effect a certain class of math intensive applications like image processing, signal processing, compression/decompression, security algorithms etc. But not everything under the sun can get significant jumps.

Infact most general purpose applications are not written from scratch for 64 bit (as it should be), but are ported. Those applications are optimized for a 32 bit architecture, and when compiled with minimum changes for 64 bit can work on 64 bit but are not tuned for it. It results in no gains whatsoever in many cases, despite the compiler using some of the extra registers and sometimes come with a performance penalty. Surprise ! Surprise !! Surprise !!! 

The reason are that many routines in C/C++ which are not CPU compute intensive cannot use all the extra registers and benefit somewhat little by the extra registers and ofcourse the understated newer & better CPU architecture. To make matters worse the data which was aligned for 32 bit (and multicore cache lines) is sometimes thrown out of alignment for 64 bit which imposes a performance penalty (memsets, memcopy, etc can become heavier for structures). The net result varies and is anybody's guess but definetely not what the 64 bit technology promised in the marketing materials. And as long & pointers (and sometimes ints in some architectures) data are now 64 bit instead of 32, in general the dynamic footprint also increases and this increase is sometimes upto 30%. And sometimes our programmer who ported has not used all the possible static analyzers correctly, or supressed compiler warnings that resulted in some bad type conversions to linger around that crash sometimes when values greater than a 32 bit integer range could hold but work fine when the values are within the 32 bit range !!! So instead of a higher performance you expected,  you end up getting  a 30% code bloat and maybe a 10% drop in performance and top it a few of random bugs (Heisenbugs) leading to a question mark on stability. I have seen scene this play out many a times. 

And if you are using an IOS 7 on iPad or IPhone 5S, you may have observed that after migrating to IOS 7 the low memory errors, random crashes Safari (with multiple tabs), frequent browser tab refreshes and crashes in other apps have increased with the diagnostic data pointing to low memory (which was otherwise absent in 32-bit IOS 6).

My intention is not to blow a  gun on your face and scare you off 64 bit Computing. 64 bit was meant for applications that need to use more than 2 or 3 GB addressable memory space in one process (IPhone/iPad have 1 GB RAM). Those are anyways not very common in general purpose applications. Maybe for large in-memory databases(caches) or scientific computing applications. And a majority of applications don't need 64 bit. So if you are a developer, look before you leap. And for those who can benefit from 64 bit, my advice would be to:

(1) Compile the code with the highest level of compiler warnings and ensure that the warnings don't point to 64 bit problems
(2) Use static PClint, KlocWorks, Coverity, Viva64 etc (better to include a special 64 bit code checker oif available) to detect 64 bit errors and get rid of them
(3) Wherever you are casting pointers to integers and vice-versa, check it carefully.
(4) Expect Code bloat. If its not acceptable you may have to use compilert size optimization and/or hand-tuning (wondering what apple and IOS app developers are going to do)
(5) Measure performance and tune data structures to play nicely on 64 bit and Multi-core cache lines
(6) Test Comprehensively.

I am not aware of a test method which can ensure that all pointers are greater than 32 bit range even if run on a machine with more than 4G physical RAM. So very strict static analysis and code review is your best bet to remove heisenbugs, even though you may be having a very comprehensive test net. 


Are there any low hanging fruits in memory footprint optimization ?

That could be one question asked after my last post. The answer is a very subjective "maybe" and here's why:

First of all the memory footprint optimization would frequently involve a space and time tradeoff. If you use code  transforms to reduce footprint, many transforms will also result in a decrease in speed (or performance). Quite common with data structures.

Second, the overall size optimization problems is actually two independent sub-problems viz. Static (or Code size) footprint optimization and Dynamic (or runtime) footprint optimization. Code size impacts you secondary storage (ROM, Flash or Disk) as well as primary storage (RAM), while runtime data influences your primary storage only. The importance of these are domain dependant. Mobile devices usually have 8/16/32 GB of flash storage (without occasionally a scope of explansion through SD Card by another 64 GB) while desktop/server storage can be in terabytes. Mobile devices have 1/2/3 GB RAM, but that available in desktop or server can range from 4-100 GB. Increasing any storagehas huge financial and battery consumption implications for mobile devices, not always so much for desktops or servers.  In some cases dynamic footprint optimization can help systems which are memory bound, by eliminating the need of extra machines. It is important, however, to remember that footprint optimization is two sub-problems and needs two seperate medicines.

Static Code sizes comprises two parts viz. Program (instructions in code segment) and Static Constant data (in bss & data segments). Usually the code segment is much larger than a data segment, unless you encounter a library which uses huge tables for some type of data mapping or transforms (like iconv).

To optimize data segment, you need to cut the data and sometimes the data is so linear that can be replaced by a mapping function which computes the result rather than look it up directly from the table. This is where my "maybe" comes from. And its very rare. Such replacing of data with computing code is good for size optimization, but negatively impacts speed. Data can also be cut by reducing duplication of constant data such as strings by storing it at one place and using a reference to it everywhere.

The code segment optimization needs more effort and is often the outcome of software bloat by enhancement or careless coding. The most efficient way to deal with it is not to take a hammer and beat out every function in sight, but to take a chisel first and carve out the software suited to the exact use case. The sculptor's toolbox should include the following methods and practices:

(i) Avoid Static linking of reused infrastructure. Consider dynamic linking as it will maintain only one copy of the instruction in memory but different data segments per process. 
(ii) if sharing is within a process space, consider developing the library without any global writable data but thread-specific (or context specific data)
(iii) Use compile/link time exclusions, code removal, etc to ensure that only the needed features are built-in and the rest are excluded from the build. This tends to give the maximum returns on investment.
(iv) After having exhaused all the above, you need to take the hammer. Unlike performance, I have not come across footprint hot-spots in instruction segments at a micro level. Maybe you will find one module/layer is very big but not one function. In such a case, you need to use the hammer on that module and not the rest. Otherwise you have no choice but to beat out everything in sight.
(v) Cutting instructions involve enabling compiler's size optimization flag, removing large loop unrolling, repetitive code, duplication & simplification of error handling, sharing of code between FSM handlers and so on. It will also involve removal of common log strings by having a table and using a pointer to that. The idea is to have less code for the same functionality.
(vi) in some cases, some implementations can compress sttaic code and uncompress it dynamically while loading into memory

In the worst outcome, you may even need to seperate the performance sensitive codebase and the size constrained one, where the latter is written completely differently. This used to be quite a common outcome if one starts by taking a server/desktop side librray and tries to port it to a memory constrained mobile device.

On the other hand, dynamic footprint is usually impacted by your runtime data structures in addition to code segment. if runtime data is the problem, then you can try transforms like structure alignment to natural boundaries, structure packing without alignment to any boundary, decreasing the storage width of elements based on range, opening less file descriptors at any stage, implementing custom memory mangement to avoid small allocations by preserving the data structure design. This may also be at times a low hanging fruit (my "maybe" again). Or you need to design a more compact (even if less efficient speed-wise) data structure. On more example is stack usage. If stack usage is the problem, things like recursion or very deep function calling tree sequence need to be avoided.

When we start any dynamic footprint optimization activity, we must clearly measure how much is the static and how much is the additional runtime footprint and then direct attention to the right place for maximum impact. Attacking impulsively any low-hanging fruit without judging how much value it can generate, is just an inefficient way of working that fails to produce tangible outcome. To summarize:

Static Footprint = Instruction segment + Data Segment
Dynamic Footprint = Static Footprint (its loaded into RAM) + Runtime Data

Usually most of the time the problem is the dynamic data, not just the static footprint in isolation on both server side and client (mobile) software.

Sunday, 20 April 2014

Low Hanging Fruits for Performance Optimization

If you are optimizing code, the biggest gains for you are likely going to come from optimizations in software design (i.e. your data structures and algorithms) and architecture (concerns like distribution, functionality, interface, multi core, etc). But usually when you run into a system which needs to be put on sterioids, this not the first thing you should be looking at. 

Let me begin break down the optimizatin problem into three stages:

First, I have seen many teams working hard to achieve some internal goal on performance of a software sub-system/module, when the software module has very little impact on the overall system performance. For eg., a protocol stack takes just 10% CPU usage in a product sub-system. If we improve the performance of this stack by 100% (in most cases by literally burning the midnight lamp), then the impact on the overall system is hardly 5%, something which is not significant. IMO optimizations like performance modeling have to be partitioned top-down, not bottom-up as software teams might be tempted to do. At times sub-system architected when challenged with a 50% improvment might be tempted to set every module in the critical path an improvment goal of 50% so that he can meet the goal. This is not the optimal way of working as the bottlnecks have to be identified quantitatively and then targets set. The benefits must be tangible for an optimization activity to be started on any peice (or whole) in the first place. 

Second, there is no end to the performance improvment that can be achieved on software. You identify & remove one bortleneck & something else gets significant. You remove that, a third might crop up and it goes one. At some stage you will not see one bottlneck but many equal contributors the conclusion being you need to deal with all of them to go to the next one. But their has to be an end to this practically. IMO, we must leran to accept that no design is final or perfect, but their is something in between that is "Just Good Enough" for that time. Optimization takes efforts. And as you optimize further it gets tougher each time. You have to consider the value (impact) of the optimization paarmeter (throughput, latency, resource usage, etc) on the whole (entire system). At some stage you are likely going to come to the conclusion that further optimization does not yield any *tangible* benefit to the end user or the product. That's when you need to stop.

Third, we assume that we have crossed both these stages, and you are convinced about the need to optimize one step, the question of how arises. As I have already said design & architecture chnages bring big benfits, but involve the maximum risk and efforts. Before you jump to this, you need to check the "hotspots"  in the present software design. Its likely that a few functions (or even routines) in the architecture are consuming a lot of CPU. For eg., an avoiadble socket open and close, an OS System call, a partcular hash lookup, etc. Many of them are low-level coding problems. You could consider if the usage of these hotspots can be reduced (avoidance), or the bottleneck itself can be improved.  A hotspot's contribution is  the product (multiplication) of both the number of times it invoked and the cost of each invocation. Optimize either (or both) and you will observe gains. These changes are localized  at times but the common trait is that they are very simple and less risky to make. That's why I call them the "Low Hanging Fruits".

A time will come when you can't do anything more with the the hotspots or their contribution to the overall system is not dominant. That's when you need to stop the hotspot based optimization and go to the  design or architecture changes (again). Data structure design & algorithm should be the first targets in this step and we can follow if required with architectural modifications.


Wednesday, 16 April 2014

Should we standardize technology or move one step ahead to standardizesoftware infrastructure

I have worked a better part of my career on standards related protocol stacks and middle-ware (SDKs, Frameworks, etc). Yet today the utility of the way the entire standardization activity is carried out can be questioned.  Standards involve the following main stages

(1) Identify a common direction or problem, encountered in the ecosystem.
(2) Multiple players get together to create a forum/working-group within a suitable SDO (that supports the ecosystem)
(3) The forum meets, brainstorms and designs a working solution on paper (i.e. charter, architecture and design). The technology is licensed under non-discriminatory to all players whether they are part f the forum or not.
(4) The ecosystem adopts the above work and everybody goes to the computer to write the supporting software infrastructure, carries out trials, inter-operability tests and the like. The problems are identified, solutions  proposed and this goes back as feedback [step (3)] for correction and enhancement.
(5) At some point the standards mature, the changes reduce and commercial deployments happen

This is how IETF, 3GPP, etc used to work. The entire cycle would take years (sometimes a decade) from start to finish. Very un-agile so as to say. Today we value frequent software releases, early feedback, welcome changes and prefer working software to bulky documentations.

 Now consider the following alternative:

(a) We have a (1) and (2), but instead of a big document (or revision) at the end of every 6 months, the SDO working group releases a simple architecture document and working software infrastructure as open-source to members every month (or even faster) highlighting how to use the new features, defects fixed rather than describe what is the wire-protocol or how it is working internally.
(b) The players immediately integrate (upgrade) the software infrastructure released by SDO in their Prototype products, carry out the tests and give feedback.
(c) Finally at some point the changes reduce and the software matures. At this point a small optimization phase is done on the software infrastructure, changes from multiple players (who are willing to share their innovation) merged into the forum OSS code-base.
(d) Subsequent testing achieves stabilization and now a final detail design document can be produced (as a subject of academic study and knowledge dissemination)  along with the working software.

How would (a)-(d) fare in terms of agility ? I think it would cut the overall development costs for every player, shorten the time to market drastically without sacrificing inter-operability (because everyone uses the  same implementation). In fact its likely that better inter-operability is achieved compared to the former process which leaves out chances of people interpreting (& implementing) some under specified parts of the standard differently (this is a very common problem). At the same time business logic innovations (that are the revenue spinners) are still protected.

Open source supporting such standardization work collaboratively done by leading players will result in a terrific force multiplier for the entire market. I already see the newer SDOs (like SDN forum with OpenFlow) working in a model similar to this by producing an implementation side by side of a document. 

Tuesday, 1 April 2014

Energy Efficiency for C Programmers

In one of my previous posts  titled "How software goes Green", I had propose that a design for resource efficiency leads to energy efficient software. What can we do as programmer ? Extend this to programming.

As with performance engineering, 70-80% of the energy profile of an application can be addressed by good design (better data structures, better algorithms). The remaining 20-30% needs some support from coding. I would warn the readers that code hand tuning is not the most rewarding process, unless for some reason you are hard pressed to squeeze the last bit of energy consumption out of the application. It is useful in thin infrastructure components and platforms, where their is a lot of functionality but the complexity (perhaps indicated by call stack depth) is low. Or some function/routine is so frqeuently used that it becomes a significant contributor to the overall energy consumption of the software.

Tune code for least runtime instruction cost (not size). As a general the fewer the runtime instructions, the more energy efficient the code will be. Unrolling small loops, were the cost of setting up a loop (initialization condition, exit condition and loop progression) is comparable to the cost of the loop body. Similarly using vector instructions will tak emuch less energy as one instruction does the jpb of many. avoid unwanted memsets, memcopy etc.  Also multi-core is the norm today, not exception. If a large computation can be paralleized across cores, its better to do that rather than run one core to its peak power for longer duration.

You can refer to any code performance tuning material for such examples.