fmII
Fri, Jul 25th home | browse | articles | contact | chat | submit | faq | newsletter | about | stats | scoop 13:57 UTC
in
Section
login «
register «
recover password «
[Article] add comment [Article]

 Optimization
 by Erik Greenwald, in Editorials - Sat, Oct 7th 2000 23:59 UTC

If you want to write really fast code, what should you do? Hone your assembly skills? Erik Greenwald thinks that's a mistake too many people make, and that it's possible to write even better-optimized code in a high-level language.


Copyright notice: All reader-contributed material on freshmeat.net is the property and responsibility of its author; for reprint rights, please contact the author directly.

Introduction

Many people seem to think that the best road to blindingly fast programs is writing as much as possible in assembly. Typically, this results in buggy software that is actually slower than that written in a higher level language. Premature optimizations can cost huge amounts of time and effort. No optimization should occur before the program works without crashing or exhibiting weird behavior.

Make it Work.

Attempting to optimize a program that doesn't even run is a pointless task. If your program crashes, then not only is it impossible to profile (binaries compiled for profiling write the data file after your program calls exit), but the users won't want to run it. Make sure you have no syntactic or memory errors in your program before even attempting to optimize it. I strongly suggest using all facilities in the language to extend warnings and errors, as well as using tools like Splint and Electric Fence to verify your program will not crash.

Make it Right.

Optimization can often make a program less readable, so you want to guarantee that the program does exactly what it should BEFORE you start optimizing. Any logical bugs should be tracked down and squished. I suggest a lot of test cases, especially at the bounds. "Off by one" errors can be tricky to find in clear code, and virtually impossible in code that has been optimized beyond readability.

Make it Fast.

Once we confirm that the program runs free of both syntactic and logical bugs, we're ready to begin the optimization. Often people will attempt to optimize prematurely or in the wrong place, resulting in wasted effort. To make optimizing easy and truly beneficial, a certain order of optimizations should take place.

Profiling

The first step in the optimization process is to locate and isolate the performance bottlenecks. Amdahl's Law states that the greatest opportunity for improvement is where the most time is spent. If a function is using 50% of the program's processor time and the time spent in that function is halved, the program will execute 25% faster. If we alter another function so it runs eight times as fast, but it only accounts for 10% of the CPU utilization, then we increase the speed of the program less than 2%. Obviously, the key to a good optimization is to improve the parts that consume the most time.

There are several tools available to do profiling, but I will concentrate on GNU gprof. It's a free tool that works with GCC and produces the amount of time spent inside of each function. Most libraries can be compiled with profiling support, so you can even use it to see if a library function is the bottleneck of your application. The gprof manual is available at http://www.gnu.org/manual/gprof/, and the software package is available at the GNU FTP site (and is probably already installed on your system if you use Linux). To get started using gprof, you only need to know how to compile your program to generate profiling information and how to view the generated information. GCC has a flag to enable profiling information generation, and the "gprof" command is used to view it. Note that gprof has to be run in the same directory you ran the program in, and requires the program as an argument.

$ gcc -o myapp myapp.c -pg
$ ./myapp
$ gprof ./myapp

Algorithmic Optimization

Once we know where the time is being consumed, we can start trimming down the time consumption. When you find a function or set of functions that consume a large amount of time, the first question you should ask yourself is "Can I approach this problem in a better, smarter way?" Big-O notation is often used to describe the efficiency of an algorithm, and is useful in this step. When I'm exploring the difference between different algorithms, I like to use a higher level language than what I'm writing the finished code in. This allows me to experiment with some radical changes to the algorithms with very little time and effort invested to explore those avenues.

Language Usage

While the algorithm itself may be sound, the implementation isn't always. If you are satisfied with the algorithm, you should look for how you expressed that algorithm in the programming language to see if there is something you could have done better. The biggest improvement is usually to cache computations instead of computing them several times. Functions that are called frequently may benefit from being inlined. A good knowledge of how the language is implemented can be very useful here.

Compiler Options

Just about every modern compiler has some fairly hefty optimization capabilities these days. Finding the right combination of flags to suit your code may take a little experimentation and reading. The GCC man page describes what the different flags do in a fair amount of detail. The first flags to consider are the "-O" flags. "-O" by itself does light optimization, and is equivalent to "-O1". To disable optimizations, you can use "-O0", and to set more aggressive optimization levels, you can use "-O2" or "-O3". Setting the machine type is also useful, as the compiler can build the assembly to use extended opcodes and registers. Most of us have x86 assemblers, so some of the flags useful for that architecture are "-m486", "-mpentium", and "-mppro", but be careful -- if you set your machine type too high, earlier machines may not be able to execute the code. The last series of flags you should consider are the specific optimization enables. To use the fast math library, for example, you can add -ffast-math to the compile string, but the O flags often incorporate several of these and are probably going to generate better code than if you try to set the "-f" flags yourself. If you know your program would benefit from a flag and the optimization level doesn't add that flag for you, you can force it.

Compiler Hints

Modern compilers do a very good job of figuring out how to convert your code into efficient assembly code. While the #pragma precompiler directives and certain keywords like "register" are still part of the language, they should generally be considered deprecated.

Assembly

When it comes to optimizing, the kneejerk reaction of a lot of people is to write in assembly. This thinking is very wrong. The compiler can almost always generate better assembly than a mere mortal. If you feel the need to write a piece in assembly, you should first examine the assembly that the compiler generates. To make gcc generate an assembly source file instead of an object or binary executable, simply call it with the "-S" flag. Examine the generated assembly file, and if you are absolutely sure that you can build better assembly than gcc, knock yourself out. I strongly suggest doing the assembly as inline assemble in the source and wrapping it and the original version in a precompiler conditional, so the program can still compile on different architectures.

Conclusion

Given this order of optimizing, you can greatly reduce the time and effort that goes into making quality software that operates in an expedient matter. I've only mentioned a few tools to help facilitate development; there are plenty of others out there (some do slightly different things, and some target different compilers/platforms). Happy optimizing.


Erik Greenwald (http://math.smsu.edu/~br0ke/) is a fourth year computer science student at Southwest Missouri State University. A Linux addict since 1996, he is active in Free Software development. He can be reached at erik@smluc.org.


T-Shirts and Fame!

We're eager to find people interested in writing editorials on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an editorial gets a freshmeat t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about.

[Comments are disabled]

 Referenced categories

Programming Language :: Assembly

 Referenced projects

binutils - Provides programs to assemble and manipulate binary and object files.
Electric Fence - A malloc() buffer-overrun debugger that uses the VM hardware.
gcc - The GNU Compiler Collection
Splint - A tool that checks C programs for security problems and coding mistakes.

 Comments

[»] assembler is really the higher language!
by frizzz - Mar 16th 2005 22:53:18

I miss in this discussion some sentences about adressing. "high-level"-programmers seem to think, that the meaning of a value is built in the value - but the meaning is reasoned by adressing the value.
Every language, which omits adressing modes is therefore lower than assembler! It is a subset of the possibilities given by a CPU. The vastest lack can be seen in omitting indexed adressing in higher languages.
In the newest releases of gcc, there is the feature provided to define a labels adress at runtime - working similar to the assembler mnemonic 'lea' (load effective adress), but I could not find out how to deal with this adress in a 'goto'-statement!
Opposite to this, it is very simple using assembler to switch branching in only one command to thousands of branches! An example could make it clear:
adress_of_label: DD label ; = a "pointer"
...... jmp [adress_of_label] ; indexed jump to label
As You can redefine the value of "adress_of_label" in every part of Your program, You can branch without a case-selection depending on status-values before You jump. You can find a lot of examples in the assembler OS, which I just published at http://www.rcfriz.de
This OS is made of only 30K code, but exceeds the well known M$DOS in every respect - only the first step to a mature kernel...
There a possibilities in using pointers to adress values in structs - but if You know, how mighty a
mov eax,[es:eax+ebx*4+displacement]
is as part of a loop, You will laugh about those ***,who like to say "C is higher abstracted..." - again: it's lower abstracted!
But there are other disadvantages, when those higher level programs are linked. You need not only more than one (higher) language for that purpose. You need a crop of bureaucracy!
I'm sorry - english is not my mothertongue. That's why I want to end with this...

--

[reply] [top]


[»] Texts/common hangups.
by Spider - Oct 15th 2000 02:52:15

okay, now for some real issue of interest if anyone is still active here:
1) Resources for info, PLEASE!
2) portability:
#ifdef __386__
asm
mov ....
end
#else
C code
#end

That is a really good thing, aye..

Now, for a quesiton/request..
is there anyone who has a tool/toolinprogress to search for repeated structures of code within a program, or several?

that is, not to search for the -same- structure, but to search for alike ones...

for x in 0 to 100 do
for y in 0 to 1000 do
for x2 in x-1 to x+1 do
for y2 in y-1 to y+1 do
colour(x,y)+=colour(x2,y2)
end end
colour(x,y)/=3^3
end end

or whatever..
run it on source, find where 1) the programmer 2) the compiler
is repeating code wich can be reused..
and also, code wich can be fast changed into some more optimized roll of assembler.
there are always certain things that are used more often in programs as assembler.. ideas/thoughts/andsoon?

--
Socially Raped

[reply] [top]


[»] ASSEMBLER
by - GALAXY - - Oct 9th 2000 11:22:18

i have written many assembler programs for many machines, like c64, snes, amiga, palm, windows
ans so on. and i must say, the best structured and best optimized programes can only been done
with assembler. its sadly true that the portability of programs for different plattforms is an disease
or nearly impossible,m also learning the new registers, code and whatever is hard to. but its still
true and even will ever be, that an ARTISTIC assembler programmer that knows how to save
tact cycles will ever create a faster code than any c fuckcompiler will do..

..to be honest programming these days, or whatever ppl call these java things went curious these
days.. so who cares..

[reply] [top]


[»] Optimization
by Zygo Blaxell - Oct 7th 2000 23:26:47

A few comments on this article, before I throw it away completely and provide a replacement. ;-)

The Intel Optimization Handbook (manual? appendix? docu-thingy? After you read the entire text of the thing, you'll forget its title too ;-) is perfect reading for those long black winter nights. Just make sure you haven't eaten first.

Optimization on real CPU architectures is a relatively simple task: you simply slot data into registers until you run out of program, and you arrange to use the parallel execution units in a relatively straightforward way. The bulk of the optimization time is spent analyzing the code's data flow structure and selecting from so many orthogonal optimization possibilities. Optimizing on RISC is a relatively straightforward operation, but it requires doing a lot of tedious counting that humans are not so good at. Often on RISC architectures the opcodes have concurrency issues with other opcodes, which humans are also not so good at.

Contrast with the IA32 architecture, where most of the task of "optimization" is recoding the program in such a way that it's semantically equivalent, but avoids all the hardware bottlenecks. After you've finished performing relatively expensive optimizations on IA32, you get performance roughly equivalent to what you would get from a RISC architecture before you tried doing optimization at all--and worse, you've already allocated all of the chip's actual resources, so you can't optimize any further. The bulk of the optimization time is spent searching a cookbook of pattern-matching algorithms to find idioms that are slow on IA32 hardware and transparently replacing them with idioms that are fast(er). Humans are good at IA32 optimization simply because they have more imagination than most compilers.

Modern IA32 has a lot of design trade-offs. 8 and 32-bit register operations are twice as fast as 16-bit ones, because Pentiums can execute two 8/32-bit opcodes per clock cycle but only one 16-bit opcode. The parallelization rhythm (the exact arrangement of opcodes to fill up the instruction prefetching pipeline so that you can execute instructions in parallel) within a Pentium I/II/III is different for each CPU generation. REAL CPU's have floating-point registers (not a stack), extension instruction sets that are not mutually exclusive (nor cost 100+ cycles to switch from one to another), and branch prediction that actually works right the first time. IA32 has weird opcodes from the 8086 era that do useful things in exceptional cases.


Another thing to consider, one which I haven't seen mentioned here yet, is that modern CPU's have a huge gap between the nominal clock rate of the CPU and the performance of the actual system. Sure, it says 1GHz on the box, but actually only 0.1% of the memory of the machine really runs at that clock rate--much less, if you count disk space as memory (and with applications that swap or use databases, disk space does count as memory). The other 99% of the RAM runs almost 10 times slower--even slower than that if there's evil RAM tricks like fast-page-mode going on. This is why some programs crawl at a snail's pace on Celerons while they fly on Pentiums or AMD chips: "it's the cache, stupid..."

The sad fact is, no amount of assembly code hand-optimization can compensate for even a handful of L2 cache misses. Your program will lose dozens of CPU cycles for every cache miss, and you won't get them back without redesign at or above the C language level. Conversely, you can get them back by redesigning your data flow, and you don't need assembly language to do that. Also, optimizations of this kind really are portable, at least to other CPU's where the data is the same size and shape, as it were.

My own approach to optimization is the same as my approach to debugging and security: don't put bugs in the program to begin with, don't introduce security holes into the program to begin with, and don't make the program slow to begin with.

Optimization, like debugging and security, begins at the design stage, and carries through until release (and even beyond, if the program is actively maintained after release).

0. Be aware of what the goals of your program really are. If your program has to run in some amount of time between one and ten hours and produce results read by millions of people, it probably doesn't need a lot of optimization, but it might need to be debugged and secure. If it has to run once every 33 milliseconds, but has limited communication with the outside world, it might need more optimization than security.

1. Be aware of the cost of correcting detected mistakes versus undetected ones. Don't write code to specifications that are provably wrong on paper. Don't integrate code into the program without testing it before and after integration.

2. Be aware of the cost of correcting security flaws before and after they become enshrined in working code. Have a security policy in mind when you input data, and carry it through until you're finished with it. Do this for all programs, even if the policy you decide on is "you will run this on a non-networked machine using data from trusted users only."

3. Be aware of the architecture of the CPU, the architecture of the compiler, and the relative costs of your data structures and algorithms. Not all memory runs at the same speed, so size can be as important as speed within a critical section of code. Know the memory structure (L1/L2 cache size) and relative speeds of the target architecture, and keep the speed-critical loops within those sizes. Don't pick algorithms that have O(N^x) running time when algorithms with O(N^(x-1)) running time are available, assuming the cost of implementing both algorithms is equal. Don't optimize a loop to save seven clock cycles per iteration if there's an L2 cache miss at the end--the latency cost for accessing main memory is so high that you'll never notice the difference. Similarly, don't design a data structure that causes unnecessary L2 cache misses--or even L1 misses, if you can avoid them. If you have multiple CPU's with threads, don't use a design that forces all of them to wait on a single lock. And so on.

On the other hand, if you know you're going to take an L2 cache hit, and you can't do anything about it, you can afford to spend a cycle or two doing something else, like run-time correctness insertions, or interpreting the byte-codes of your program.

[reply] [top]


[»] register keyword
by Jay F Kominek - Oct 7th 2000 22:03:58

I'd like to point out that very few, if any compilers use perfect register allocation algorithms. Your compiler does not understand your algorithm, and cannot. I have personally seen cases where specifying which variables to keep in registers improved speed drastically.

(There are a number of other flaws with this post which I'm not going to take the time to itemize. Go read your compiler's source code and see if it looks like it knows more about your CPU than you. If so, don't waste your time writing assembly. If you do know more than it, then you might as well spend as much time as you want on it, because it doesn't matter to the rest of the world if you waste your time.)

--
- Jay Kominek

[reply] [top]


[»] optimization article and comments so far...
by ajs - Oct 7th 2000 18:47:59

I find it very interesting that the freshmeat community would think much about optimization. In this community, I'd think portability would be much more important.

I worry about optimization all the time -- I write DSP and embedded code for a living, most recently video codecs -- so let me share a few thoughts. First, always use basic OO concepts, even (and maybe especially) if you're writing in C. Design an appropriate data structure and functions to operate on it. Avoid globals! Your code will stay small and tight and reentrant. And if you're writing in C++, avoid a lot of the non-portable aspects of the language, even if you think it should make your code faster. See the mozilla c++ portablity guide for details.

For example, a lot of times, image data needs to be accessed by row or by column, by region-of-interest, by even/odd field, or by 8x8 block. In these cases, I create a data structure that takes a pointer to the base of the image data, and length and stride values for both dimensions. This basic data structure and its methosds can point to the whole image, a field or a block, so you don't drown in memcpy() when you work with pieces of the image.

Also, write good Makefiles so you can build hyper-optimized, optimized with profiling, or debug, and with different compilers, without rewriting your makefile each time. If you're building for Linux, or using gcc and you're just doing -O2, try this instead:

-O3 -funroll-loops -fomit-frame-pointer -fno-strength-reduce -ffast-math

I don't know if the Pentium-Optimized GCC (pgcc ) effort is still ongoing, but these guys were trying to make gcc much faster for x86. Hopefully their efforts have been folded in with the main gcc development.

If you're not working in Linux, try the machine's native compiler if it's available. I had a huge speedup on an application running on an SGI Octane, just by going to the SGI compiler instead of gcc. (I love gcc., no flames please!) I

If you're in Win32-hell and can spare a few bucks, buy the VTune package. This includes Intel's compiler which smokes Microsoft's (big surprise...) and also produces 10-20% faster code than Cygwin gcc in my experience. VTune is a profiler that can give very detailed reports (down to individual lines of code, which can be viewed c/asm interleaved.)

When you start an optimization effort, go for the easy stuff first. The easiest thing is to inline functions. If one function is being called 10-100x more than all the others, inline it. Also look there for variables touched in the inner loop. Sometimes declaring one of these as register, or removing a repeated intermediate-result calculation, can have a huge effect. Even simple division or intermediate-result calculation in floating point can be more costly than you think, if it's inside a nested loop!

Finally, if you find you need to go to assembly, always #ifdef that portable C version you wrote, so the poor sot who inherits your code and has to port to another architecture has something to start from! And it usually helps to start your optimization effort with the compiler's assembly output. Some compilers are better than others for this.

-- Alan

[reply] [top]


[»] Here we go again
by Anton Aylward - Oct 7th 2000 18:43:05

I remember arguing this over 20 years ago when compilers were less sophisiticated and assembly languages more regular. That same old chestnut about justifying assembly for the inner loop was used as a last ditch defense then as well.
The best argument I've ever heard against this is that processors are getting faster. Sadly the argument against that is programmers are writing code to soak up that extra power - sort of a Parkinson's Law. Look at it this way. In 1982 I was administering a PDP-11/45 with 4Meg or semicondutor RAM and a twin 300Meg disk drives that was being used as a development platform for 40 people. My 350MHz IBM Pentium desktop workstation at 'work' with 96Meg and 8Gig running W/95 crawls by comparison.

There are a few lessons here. The first is 'don't write
unnnessary code'.

But the lesson I'd like you to remember is the same one we preached back in the 70s, and it still applies here. Yes, you may be able to write better assembly code than the compiler can produce, but can you do it consitently, day in and day out, in a clear, well documented, logical, easy to maintain form that will be understood by others years from now?

Personally, I've come to see compilers make great strides. What you should be doing now is looking at the compiler as an expert system for assembly code production. It has the distilled wisdom of many very smart people. This approach means that even mediocre programmers can be effective, and we shure as hell need more programmers! It means that good and excellent programmers can be much more productive. The compiler is in effect an amplifer. It amplifies the amount of code, it peforms all manner of checks (if you let it, if you pick a suitable language). All this is well documented. The last time I fought this battle I got fired; by even the most conservative metric, using the Software Tools package and GCC I was at elast 80 times as effective, that is I produced code that was fully tested, packaged, documented.... as teh assembly programers on the project. And when the tried shaking me off by changing the data structures, I was back on track in under 20 minutes.

The trouble was they weren't paying me 80 times as much as the assembly programmers. Which is one reason I don't work as a programmer any more. There are easier ways to make lots of money than coding all the hours God sends.

If you want to re-fight this old war, then fine, but at least don't pretend it hasn't been fought before and that the assembly side did not come out ahead. Research history and the territory before you try fighting this over again.

Business wants to get product out the door. HLLs are amplifiers and expert systems for code production. Sorry guys, you can beleive whatever you want to believe, but there is a basic economic bottom line.

[reply] [top]


[»] University Teachings
by Adam Bellaire - Oct 7th 2000 18:16:44

This is a dangerous pitfall that is still being more or less 'taught' in universities. I know because they tried to teach it to me. The fact is, when I started studying here, the first time Assembly came up in my Computer Organization course it was heavily emphasized that assembly was used for its efficiency, but this course in assembly was taught BEFORE the course in Software Engineering which taught about the emphasis that must be given to correctness first. Different professors give different signals. The intro professor made sure to make a point that C compilers are vastly more efficient than most programmers, which makes sense since it was an intro course and most of the people in it probably didn't have the knowledge of asm to use it more effectively than the compiler. But there really wasn't much emphases on this 'correctness first' issue, and the workability and the 'software crisis' didn't come up until my junior year!

I'm just glad to see an article about this, because I feel it gave a very good overview in just a couple of pages of a process that's extremely important and that I feel my education, despite its expense, failed to address adequately.

[reply] [top]


[»] Optimization
by Matthias M Giwer - Oct 7th 2000 17:02:21

YMM(Will)V

I am at a loss as to why such an article still needs be written. Make it work first; be clever later. There is very little market for the fast routines that don't work. Schedules call for delivery of working products.

Besides, it might be good enough the first time around and you get the reputation of a miracle worker. Fast enough may be good enough. The "need for speed" is for bragging rights not necessarily for the real world. Better really is the enemy of good enough.

Despite the cynicism of some in the business, it is much easier to sell improvements than explain failures. And in this high stress business having a working product ahead of schedule reduces the Tagamet budget.

Of course it is always good to have a management name for it. "Proof of concept," sounds good. I'm sure Dogbert can come up with a better one.

The point is well taken regarding today's compilers, although it does depend upon the skill of those who write compilers, optimum code for an 8086 is not expected to be optimum for a PII - MMX comes to mind. The suggestion to look at the compiler's machine code first, even if only as a cheat sheet, should be mandatory. Compilers can't be optimal for every procedure in every context but that is likely the last place for the problem to be.

It is not a matter of defending or deriding ASM but rather leave it as a last resort only if necessary.

Anyone who has gone the route of making it work first can tell some great stories. My best example(from 14 years ago) is 18 minutes for the first brute force version that worked correctly down to 45 seconds with only two improvements. But then back up to 2 minutes when I "improved" it again. If I had put all three improvements in the first version ... you can see where that leads. And those improvements were in the TP not assembly. It was all in the parts unique to the program not in things like a faster sort.

Often it is best to use known tools wherever possible. The above program took maybe fifty hours in TPascal and needed programs to write program code and bloated up to about 8000 lines. Just a few months ago I duplicated the unimproved version in a couple hours with an 18 line shell script calling standard 'nix tools.

Considering the speed difference (8 v 333MHz) my test file back then was Patrick Henry's Liberty or Death speech which gave those times. This version does the entire novel 1984 in 40 some seconds with nothing but the brute force approach with no improvements.

Speaking of hardware, consider the total cost of a programmer per hour (roughly double the pay) and the hours it will take to write and test any code improvements including ASM. Then see how that compares to the cost of a faster machine in terms of the desired speed improvment. That is significantly less than twenty hours labor for a 666 machine. (No one tell Jerry Falwell the real speed.)

Again a few months ago, I knew how to render a triangle mesh in POVray even though the resulting mesh was an 11Meg file. It worked (and managed to crash KDE graphics terminals so it had to be rendered from a real console.) Then I dug into the TARGA format and learned how to create one which renders much faster and is under 200k file size. If I had gone direct to .tga format would the bugs be in the program that generated the file or in my understanding of the file format? or both? And I stuck with what I know, TPascal (now released for linux and others as Free Pascal), rather than trying one of the numerous linux languages I now have for free.

I was convinced back then KISS was best and nothing in the intervening years has even come close to changing my opinion. It has never failed me. As I think Heinlein said, all progress is due to the lazy man.

No program is ever bug free until all of the people whose lives the last user affected have joined him in eternity.

[reply] [top]


[»] possible misinterpretations
by Erik Greenwald - Oct 7th 2000 16:52:26

I never said to never use assembly. I've done assembly on a few different platforms (even the crude mnemonic coding done on commadore 64's via a monitor). I see a lot of people who feel re-writing in assembly is the best first approach. I've seen projects where people brag it will be fast because they're doing it almost entirely in assembly. That's the kneejerk reaction I was referring to :)

Inline assembly: If you write any assembly at all, you are killing portability. Period. Inline assembly gives you the chance to keep it all in one place and do some preprocessor magic so a portable (possibly) alternative is available. #ifdef __X86_GAS ...

High level languages: Most people develop in C/C++, so I used that as the reference high level lang. I have some opengl code in scheme I wrote to experiment with some algorithms, it is very sufficient. But most people don't write in scheme, so *shrug* :)

'Big picture': If you either explicitely inline the functions, or tell the compiler to inline what it can, then it will see 'the big picture'. Vital parts may benefit from recoding in assembly, but first you need to see where those vital parts are, second you need to try to call those vital parts as little as possible.

Kneejerk reaction: Portability is a key issue. Some of my friends who use non-linux unices like to rant about how non-portable linux code is. Yet more of my friends use non-x86 hardware with linux, and are at the mercy of programmers to do the right thing in respect to endian and assembly issues. Plenty of asm code is in asm because that's the right thing to do. Thinking asm is an instant speed cure is wrong. The assembly section mentions seeing exactly what gcc is doing, then seeing how to do it better. Rewriting something in assembly that doesn't offer a real speedup is pointless and wasteful. :) Amdahl's law and Occam's Razor.

Data structures: I should have made a mention of finding better data structures in the section about algorithmic optimization. Thanks for pointing that out :)

If anyone feels I said that assembly at all is wrong, I apologize. It wasn't my intent to insult the language nor the people who use the languge. Assembly has its place, and is required for certain things. Part of my motivation in writing this was to counter the people who believe that assembly is the first avenue to good optimization. There are a lot of steps to take before using assembly that can result in much better improvements to speed and/or size.


-Erik

[reply] [top]


[»] Optimization
by ehlarson - Oct 7th 2000 15:03:31

Interesting article, but I think it misses the most important form of optimization of all - data representation. Often the way that you represent the problem in data structures determines the algorithms that are available to you, and thus how fast the problem can be solved. There are a number of examples of this described in books like 'Programming Pearls' and 'The Mythical Man Month'. My favorite is perhaps a bit esoteric, but a good example nonetheless - use of the method of orthogonal collocation to collapse a partial differntial equation to a set of ordinary differential equations. This nifty mathematical trick alows you to reduce the dimensionality of a problem by one, or reduce the big O by one - i.e. speed up the solution by a factor of 10. Finding a clever data representation is often the lightning bolt that shatters the oak tree.

[reply] [top]


[»] One of my favorite tutorial about optimisation
by Eg0r - Oct 7th 2000 14:19:23

From Imaging Technology. Get the first part here and the second part here.

Okay, the journal concentrates on image processing, but it's no reason to skip these excellent tutorial.
While the first part goes quite in depth on C optimisation, with tricks like using pointers instead of increment variables, loop unrolling and optimisation in assembly (the bad way at first), the second part explains how MMX can be used to accelerate an assembly written version of the algorithm.

Well written and definitely worth a read... Any other good tutorials on the internet?

[reply] [top]


[»] C contra ASM
by Spider - Oct 7th 2000 12:47:28

Now this is really something to think through first, yes.
C -> asm is -not- probably better than handcoded asm.
to give a quick 'n' dirty example

function putpixl
{
blagh blagh made in C
}
function getpixl {
blagh blagh made in c
}

main big loop {
for x is 0 to 640
for y is 0 to 480
putpixl ( getpixl + getpixl + getpixl + getpixl /4)
;
;
once that piece of code is working, optimize! asm is much easier to optimize with some experience in the matter. and very few compilers manage to recode several function calls into a single block of code, and then optimize it... simply becasue they lack the ability to get the overview.

Now, a compiler can do nice things to loops.. unroll them for example... how well they do that in an assembler/c mix.. will, no fsck idea... :P

a good bottleneck to check for is how dependent you are on others code. if your greatbigohnice gfx of a sorting with comparsion doesn't go fast, is it because of the sorting, or is it because that you use QT, need extra 87Mb ram and are dependent on that? bad boy, back to square A :p

in general, you can actually gain from 15-250% on recoding vital parts in assembly. question is, is it worth it?

in general, things that gain from asm is:
low level libraries. anything from lists,sorting,gfx,stringmatching.
very intense loops/encoding/decoding.
would be interesting to see a comparsion of a C - png encode/decode with an asm. derival...

/set rant_mode off

--
Socially Raped

[reply] [top]


[»] Higher-level languages
by Kaufmann - Oct 7th 2000 11:38:23

The observation that the compiler will often generate better assembly than hand-coding also applies to a higher-level point-of-view. Applications programmers who use C or C++ are often forced to reinvent the wheel of common idioms and get caught up in low-level implementation details. This is especially true for applications which require symbolic or declarative functionality. Because of this, even though C/C++'s performance may be better locally than that of languages such as Smalltalk, Python, Haskell, ML, Prolog or even Common Lisp, quite often these languages will provide more satisfactory global performance; and thanks to the ever-improving optimising compilers for VHLLs, I've seen many large applications in Ocaml or Common Lisp that could perform as well as their C/C++ equivalents.

As for systems programming, a new compiler has been developed called PreScheme which accepts a statically type-decidable subset of Scheme, performs type-checking and compiles it to C with very satisfactory results; this combines the expressiveness of nearly-full Scheme with the performing edge of lower-level languages.

So, software in HLLs can be optimal too! :)

[reply] [top]


[»] Don't be so quick to insult assembly
by Dan Papasian - Oct 7th 2000 08:21:56

A proficent assembly programmer can easily write better code than GCC for many of the more speed-intensive portions.

As for recommending inline assembly, it will hurt portability. It is not defined in any ISO C standard, and exists only in gcc as an extension. You will lose portability using it. (Of course, if you're using asm, portability is probably not your first concern)

Usually the speed assembly gives you isn't needed, but in the case of _very_ time critical code, it should not be ruled out.

[reply] [top]




© Copyright 2008 SourceForge, Inc., All Rights Reserved.
About freshmeat.net •  Privacy Statement •  Terms of Use •  Trademark Guidelines •  Advertise •  Contact Us • 
ThinkGeek •  Slashdot  •  ITMJ •  Linux.com •  NewsForge  •  SourceForge.net  •  Surveys •  Jobs •  PriceGrabber