 |
Optimization
by Erik Greenwald, in Editorials - Sat, Oct 7th 2000 23:59 UTC
If you want to write really fast code, what should you do? Hone your
assembly skills? Erik Greenwald thinks that's a mistake too many
people make, and that it's possible to write even better-optimized
code in a high-level language.
Copyright notice: All reader-contributed material on freshmeat.net
is the property and responsibility of its author; for reprint rights, please contact the author
directly.
Introduction
Many people seem to think that the best road to blindingly fast
programs is writing as much as possible in assembly. Typically, this
results in buggy software that is actually slower than that written in
a higher level language. Premature optimizations can cost huge amounts
of time and effort. No optimization should occur before the program
works without crashing or exhibiting weird behavior.
Make it Work.
Attempting to optimize a program that doesn't even run is a pointless
task. If your program crashes, then not only is it impossible to
profile (binaries compiled for profiling write the data file after
your program calls exit), but the users won't want to run it. Make
sure you have no syntactic or memory errors in your program before
even attempting to optimize it. I strongly suggest using all
facilities in the language to extend warnings and errors, as well as
using tools like Splint and Electric Fence
to verify your program will not crash.
Make it Right.
Optimization can often make a program less readable, so you want to guarantee
that the program does exactly what it should BEFORE you start optimizing. Any
logical bugs should be tracked down and squished. I suggest a lot of test
cases, especially at the bounds. "Off by one" errors can be tricky to find in
clear code, and virtually impossible in code that has been optimized beyond
readability.
Make it Fast.
Once we confirm that the program runs free of both syntactic and logical bugs,
we're ready to begin the optimization. Often people will attempt to optimize
prematurely or in the wrong place, resulting in wasted effort. To make
optimizing easy and truly beneficial, a certain order of optimizations should
take place.
Profiling
The first step in the optimization process is to locate and isolate
the performance bottlenecks. Amdahl's Law states that the greatest
opportunity for improvement is where the most time is spent. If a
function is using 50% of the program's processor time and the time
spent in that function is halved, the program will execute 25%
faster. If we alter another function so it runs eight times as fast,
but it only accounts for 10% of the CPU utilization, then we increase
the speed of the program less than 2%. Obviously, the key to a good
optimization is to improve the parts that consume the most time.
There are several tools available to do profiling, but I will
concentrate on GNU
gprof. It's a free tool that works with GCC and produces the
amount of time spent inside of each function. Most libraries can be
compiled with profiling support, so you can even use it to see if a
library function is the bottleneck of your application. The gprof
manual is available at http://www.gnu.org/manual/gprof/,
and the software package is available at the GNU FTP site (and is
probably already installed on your system if you use Linux). To get
started using gprof, you only need to know how to compile your program
to generate profiling information and how to view the generated
information. GCC has a flag to enable profiling information
generation, and the "gprof" command is used to view it. Note that
gprof has to be run in the same directory you ran the program in, and
requires the program as an argument.
$ gcc -o myapp myapp.c -pg
$ ./myapp
$ gprof ./myapp
Algorithmic Optimization
Once we know where the time is being consumed, we can start trimming
down the time consumption. When you find a function or set of
functions that consume a large amount of time, the first question you
should ask yourself is "Can I approach this problem in a better,
smarter way?" Big-O notation is often used to describe the efficiency
of an algorithm, and is useful in this step. When I'm exploring the
difference between different algorithms, I like to use a higher level
language than what I'm writing the finished code in. This allows me to
experiment with some radical changes to the algorithms with very
little time and effort invested to explore those avenues.
Language Usage
While the algorithm itself may be sound, the implementation isn't always. If
you are satisfied with the algorithm, you should look for how you expressed
that algorithm in the programming language to see if there is something you
could have done better. The biggest improvement is usually to cache
computations instead of computing them several times. Functions that are called
frequently may benefit from being inlined. A good knowledge of how the language
is implemented can be very useful here.
Compiler Options
Just about every modern compiler has some fairly hefty optimization
capabilities these days. Finding the right combination of flags to
suit your code may take a little experimentation and reading. The GCC
man page describes what the different flags do in a fair amount of
detail. The first flags to consider are the "-O" flags. "-O" by itself
does light optimization, and is equivalent to "-O1". To disable
optimizations, you can use "-O0", and to set more aggressive
optimization levels, you can use "-O2" or "-O3". Setting the machine
type is also useful, as the compiler can build the assembly to use
extended opcodes and registers. Most of us have x86 assemblers, so
some of the flags useful for that architecture are "-m486",
"-mpentium", and "-mppro", but be careful -- if you set your machine
type too high, earlier machines may not be able to execute the
code. The last series of flags you should consider are the specific
optimization enables. To use the fast math library, for example, you
can add -ffast-math to the compile string, but the O flags often
incorporate several of these and are probably going to generate better
code than if you try to set the "-f" flags yourself. If you know your
program would benefit from a flag and the optimization level doesn't
add that flag for you, you can force it.
Compiler Hints
Modern compilers do a very good job of figuring out how to convert
your code into efficient assembly code. While the #pragma precompiler
directives and certain keywords like "register" are still part of the
language, they should generally be considered deprecated.
Assembly
When it comes to optimizing, the kneejerk reaction of a lot of people
is to write in assembly. This thinking is very wrong. The compiler can
almost always generate better assembly than a mere mortal. If you feel
the need to write a piece in assembly, you should first examine the
assembly that the compiler generates. To make gcc generate an assembly
source file instead of an object or binary executable, simply call it
with the "-S" flag. Examine the generated assembly file, and if you
are absolutely sure that you can build better assembly than gcc, knock
yourself out. I strongly suggest doing the assembly as inline assemble
in the source and wrapping it and the original version in a
precompiler conditional, so the program can still compile on different
architectures.
Conclusion
Given this order of optimizing, you can greatly reduce the time and
effort that goes into making quality software that operates in an
expedient matter. I've only mentioned a few tools to help facilitate
development; there are plenty of others out there (some do slightly
different things, and some target different
compilers/platforms). Happy optimizing.
Erik Greenwald (http://math.smsu.edu/~br0ke/)
is a fourth year computer science student at Southwest Missouri State
University. A Linux addict since 1996, he is active in Free Software
development. He can be reached at erik@smluc.org.
T-Shirts and Fame!
We're eager to find people interested in writing editorials on
software-related topics. We're flexible on length, style, and topic,
so long as you know what you're talking about and back up your
opinions with facts. Anyone who writes an editorial gets a freshmeat
t-shirt from ThinkGeek in
addition to 15 minutes of fame. If you think you'd like to try your
hand at it, let jeff.covey@freshmeat.net
know what you'd like to write about.
[Comments are disabled]
|
Referenced categories
Programming Language :: Assembly
|
Referenced projects
binutils - Provides programs to assemble and manipulate binary and object files.
Electric Fence - A malloc() buffer-overrun debugger that uses the VM hardware.
gcc - The GNU Compiler Collection
Splint - A tool that checks C programs for security problems and coding mistakes.
|
Comments
[»]
assembler is really the higher language!
by frizzz - Mar 16th 2005 22:53:18
I miss in this discussion some sentences about adressing.
"high-level"-programmers seem to think, that the meaning
of a value is built in the value - but the meaning is
reasoned by adressing the value. Every language,
which omits adressing modes is therefore lower than
assembler! It is a subset of the possibilities given by a
CPU. The vastest
lack can be seen in omitting indexed adressing in higher
languages. In the newest releases of gcc, there is the
feature provided to define a labels adress at runtime -
working similar to the assembler mnemonic 'lea' (load
effective adress), but I could not find out how to deal with
this adress in a 'goto'-statement! Opposite to this, it is
very simple using assembler to switch branching in only
one command to thousands of branches! An example
could make it clear:
adress_of_label: DD label ; = a "pointer"
......
jmp [adress_of_label] ; indexed jump to label
As You can redefine the value of "adress_of_label" in
every part of Your program, You can branch without a
case-selection depending on status-values before You
jump. You can find a lot of examples in the assembler OS,
which I just published at http://www.rcfriz.de
This OS is made of only 30K code, but exceeds the well
known M$DOS in every respect - only the first step to a
mature kernel... There a possibilities in using pointers
to adress values in structs - but if You know, how mighty
a mov eax,[es:eax+ebx*4+displacement] is as
part of a loop, You will laugh about those ***,who like to
say "C is higher abstracted..." - again: it's lower
abstracted! But there are other disadvantages, when
those higher level programs are linked. You need not only
more than one (higher) language for that purpose. You
need a crop of bureaucracy! I'm sorry - english is not
my mothertongue. That's why I want to end with this...
--
[reply]
[top]
[»]
Texts/common hangups.
by Spider - Oct 15th 2000 02:52:15
okay, now for some real issue of interest if anyone is still active here:
1) Resources for info, PLEASE!
2) portability:
#ifdef __386__
asm
mov ....
end
#else
C code
#end
That is a really good thing, aye..
Now, for a quesiton/request..
is there anyone who has a tool/toolinprogress to search for repeated
structures of code within a program, or several?
that is, not to search for the -same- structure, but to search for
alike ones...
for x in 0 to 100 do
for y in 0 to 1000 do
for x2 in x-1 to x+1 do
for y2 in y-1 to y+1 do
colour(x,y)+=colour(x2,y2)
end end
colour(x,y)/=3^3
end end
or whatever..
run it on source, find where 1) the programmer 2) the compiler
is repeating code wich can be reused..
and also, code wich can be fast changed into some more optimized roll
of assembler.
there are always certain things that are used more often in programs
as assembler.. ideas/thoughts/andsoon?
-- Socially Raped
[reply]
[top]
[»]
ASSEMBLER
by - GALAXY - - Oct 9th 2000 11:22:18
i have written many assembler programs for many machines, like c64, snes,
amiga, palm, windows
ans so on. and i must say, the best structured and best optimized
programes can only been done
with assembler. its sadly true that the portability of programs for
different plattforms is an disease
or nearly impossible,m also learning the new registers, code and
whatever is hard to. but its still
true and even will ever be, that an ARTISTIC assembler programmer that
knows how to save
tact cycles will ever create a faster code than any c fuckcompiler
will do..
..to be honest programming these days, or whatever ppl call these java
things went curious these
days.. so who cares..
[reply]
[top]
[»]
Optimization
by Zygo Blaxell - Oct 7th 2000 23:26:47
A few comments on this article, before I throw it away completely and
provide a replacement. ;-)
The Intel Optimization Handbook (manual? appendix? docu-thingy?
After you read the entire text of the thing, you'll forget its title
too ;-) is perfect reading for those long black winter nights. Just make
sure you haven't eaten first.
Optimization on real CPU architectures is a relatively simple task:
you simply slot data into registers until you run out of program, and you
arrange to use the parallel execution units in a relatively straightforward
way. The bulk of the optimization time is spent analyzing the code's
data flow structure and selecting from so many orthogonal optimization
possibilities. Optimizing on RISC is a relatively straightforward
operation, but it requires doing a lot of tedious counting that humans are
not so good at. Often on RISC architectures the opcodes have concurrency
issues with other opcodes, which humans are also not so good at.
Contrast with the IA32 architecture, where most of the task of
"optimization" is recoding the program in such a way that
it's semantically equivalent, but avoids all the hardware bottlenecks.
After you've finished performing relatively expensive optimizations
on IA32, you get performance roughly equivalent to what you would get from
a RISC architecture before you tried doing optimization at all--and worse,
you've already allocated all of the chip's actual resources, so
you can't optimize any further. The bulk of the optimization time is
spent searching a cookbook of pattern-matching algorithms to find idioms
that are slow on IA32 hardware and transparently replacing them with idioms
that are fast(er). Humans are good at IA32 optimization simply because
they have more imagination than most compilers.
Modern IA32 has a lot of design trade-offs. 8 and 32-bit register
operations are twice as fast as 16-bit ones, because Pentiums can execute
two 8/32-bit opcodes per clock cycle but only one 16-bit opcode. The
parallelization rhythm (the exact arrangement of opcodes to fill up the
instruction prefetching pipeline so that you can execute instructions in
parallel) within a Pentium I/II/III is different for each CPU generation.
REAL CPU's have floating-point registers (not a stack), extension
instruction sets that are not mutually exclusive (nor cost 100+ cycles to
switch from one to another), and branch prediction that actually works
right the first time. IA32 has weird opcodes from the 8086 era that do
useful things in exceptional cases.
Another thing to consider, one which I haven't seen mentioned
here yet, is that modern CPU's have a huge gap between the nominal
clock rate of the CPU and the performance of the actual system. Sure, it
says 1GHz on the box, but actually only 0.1% of the memory of the machine
really runs at that clock rate--much less, if you count disk space as
memory (and with applications that swap or use databases, disk space does
count as memory). The other 99% of the RAM runs almost 10 times
slower--even slower than that if there's evil RAM tricks like
fast-page-mode going on. This is why some programs crawl at a snail's
pace on Celerons while they fly on Pentiums or AMD chips: "it's
the cache, stupid..."
The sad fact is, no amount of assembly code hand-optimization can
compensate for even a handful of L2 cache misses. Your program will lose
dozens of CPU cycles for every cache miss, and you won't get them back
without redesign at or above the C language level. Conversely, you can get
them back by redesigning your data flow, and you don't need assembly
language to do that. Also, optimizations of this kind really are portable,
at least to other CPU's where the data is the same size and shape, as
it were.
My own approach to optimization is the same as my approach to
debugging and security: don't put bugs in the program to begin with,
don't introduce security holes into the program to begin with, and
don't make the program slow to begin with.
Optimization, like debugging and security, begins at the design stage,
and carries through until release (and even beyond, if the program is
actively maintained after release).
0. Be aware of what the goals of your program really are. If your
program has to run in some amount of time between one and ten hours and
produce results read by millions of people, it probably doesn't need a
lot of optimization, but it might need to be debugged and secure. If it
has to run once every 33 milliseconds, but has limited communication with
the outside world, it might need more optimization than security.
1. Be aware of the cost of correcting detected mistakes versus
undetected ones. Don't write code to specifications that are provably
wrong on paper. Don't integrate code into the program without testing
it before and after integration.
2. Be aware of the cost of correcting security flaws before and after
they become enshrined in working code. Have a security policy in mind when
you input data, and carry it through until you're finished with it.
Do this for all programs, even if the policy you decide on is "you
will run this on a non-networked machine using data from trusted users
only."
3. Be aware of the architecture of the CPU, the architecture of the
compiler, and the relative costs of your data structures and algorithms.
Not all memory runs at the same speed, so size can be as important as speed
within a critical section of code. Know the memory structure (L1/L2 cache
size) and relative speeds of the target architecture, and keep the
speed-critical loops within those sizes. Don't pick algorithms that
have O(N^x) running time when algorithms with O(N^(x-1)) running time are
available, assuming the cost of implementing both algorithms is equal.
Don't optimize a loop to save seven clock cycles per iteration if
there's an L2 cache miss at the end--the latency cost for accessing
main memory is so high that you'll never notice the difference.
Similarly, don't design a data structure that causes unnecessary L2
cache misses--or even L1 misses, if you can avoid them. If you have
multiple CPU's with threads, don't use a design that forces all
of them to wait on a single lock. And so on.
On the other hand, if you know you're going to take an L2 cache
hit, and you can't do anything about it, you can afford to spend a
cycle or two doing something else, like run-time correctness insertions, or
interpreting the byte-codes of your program.
[reply]
[top]
[»]
register keyword
by Jay F Kominek - Oct 7th 2000 22:03:58
I'd like to point out that very few, if any compilers use perfect register
allocation algorithms. Your compiler does not understand your algorithm,
and cannot. I have personally seen cases where specifying which variables
to keep in registers improved speed drastically.
(There are a number of other flaws with this post which I'm not going to
take the time to itemize. Go read your compiler's source code and see if it
looks like it knows more about your CPU than you. If so, don't waste your
time writing assembly. If you do know more than it, then you might as well
spend as much time as you want on it, because it doesn't matter to the rest
of the world if you waste your time.)
-- - Jay Kominek
[reply]
[top]
[»]
optimization article and comments so far...
by ajs - Oct 7th 2000 18:47:59
I find it very interesting that the freshmeat community would think much
about optimization. In this community, I'd think portability would be much
more important.
I worry about optimization all the time -- I write DSP and embedded
code for a living, most recently video codecs -- so let me share a few
thoughts. First, always use basic OO concepts, even (and maybe especially)
if you're writing in C. Design an appropriate data structure and functions
to operate on it. Avoid globals! Your code will stay small and tight and
reentrant. And if you're writing in C++, avoid a lot of the non-portable
aspects of the language, even if you think it should make your code faster.
See the mozilla c++ portablity guide for details.
For example, a lot of times, image data needs to be accessed by row or
by column, by region-of-interest, by even/odd field, or by 8x8 block. In
these cases, I create a data structure that takes a pointer to the base of
the image data, and length and stride values for both dimensions. This
basic data structure and its methosds can point to the whole image, a field
or a block, so you don't drown in memcpy() when you work with pieces of the
image.
Also, write good Makefiles so you can build hyper-optimized, optimized
with profiling, or debug, and with different compilers, without rewriting
your makefile each time. If you're building for Linux, or using gcc and
you're just doing -O2, try this instead:
-O3 -funroll-loops -fomit-frame-pointer -fno-strength-reduce
-ffast-math
I don't know if the Pentium-Optimized GCC (pgcc ) effort is still
ongoing, but these guys were trying to make gcc much faster for x86.
Hopefully their efforts have been folded in with the main gcc development.
If you're not working in Linux, try the machine's native compiler if
it's available. I had a huge speedup on an application running on an SGI
Octane, just by going to the SGI compiler instead of gcc. (I love gcc., no
flames please!) I
If you're in Win32-hell and can spare a few bucks, buy the VTune
package. This includes Intel's compiler which smokes Microsoft's (big
surprise...) and also produces 10-20% faster code than Cygwin gcc in my
experience. VTune is a profiler that can give very detailed reports (down
to individual lines of code, which can be viewed c/asm interleaved.)
When you start an optimization effort, go for the easy stuff first.
The easiest thing is to inline functions. If one function is being called
10-100x more than all the others, inline it. Also look there for variables
touched in the inner loop. Sometimes declaring one of these as register, or
removing a repeated intermediate-result calculation, can have a huge
effect. Even simple division or intermediate-result calculation in floating
point can be more costly than you think, if it's inside a nested loop!
Finally, if you find you need to go to assembly, always #ifdef that
portable C version you wrote, so the poor sot who inherits your code and
has to port to another architecture has something to start from! And it
usually helps to start your optimization effort with the compiler's
assembly output. Some compilers are better than others for this.
-- Alan
[reply]
[top]
[»]
Here we go again
by Anton Aylward - Oct 7th 2000 18:43:05
I remember arguing this over 20 years ago when compilers were less
sophisiticated and assembly languages more regular. That same old chestnut
about justifying assembly for the inner loop was used as a last ditch
defense then as well.
The best argument I've ever heard against this is that processors
are getting faster. Sadly the argument against that is programmers are
writing code to soak up that extra power - sort of a Parkinson's Law.
Look at it this way. In 1982 I was administering a PDP-11/45 with 4Meg or
semicondutor RAM and a twin 300Meg disk drives that was being used as a
development platform for 40 people. My 350MHz IBM Pentium desktop
workstation at 'work' with 96Meg and 8Gig running W/95 crawls by
comparison.
There are a few lessons here. The first is 'don't write
unnnessary code'.
But the lesson I'd like you to remember is the same one we
preached back in the 70s, and it still applies here. Yes, you may be able
to write better assembly code than the compiler can produce, but can you do
it consitently, day in and day out, in a clear, well documented, logical,
easy to maintain form that will be understood by others years from now?
Personally, I've come to see compilers make great strides. What
you should be doing now is looking at the compiler as an expert system for
assembly code production. It has the distilled wisdom of many very smart
people. This approach means that even mediocre programmers can be
effective, and we shure as hell need more programmers! It means that good
and excellent programmers can be much more productive. The compiler is in
effect an amplifer. It amplifies the amount of code, it peforms all manner
of checks (if you let it, if you pick a suitable language). All this is
well documented. The last time I fought this battle I got fired; by even
the most conservative metric, using the Software Tools package and GCC I
was at elast 80 times as effective, that is I produced code that was fully
tested, packaged, documented.... as teh assembly programers on the project.
And when the tried shaking me off by changing the data structures, I was
back on track in under 20 minutes.
The trouble was they weren't paying me 80 times as much as the
assembly programmers. Which is one reason I don't work as a
programmer any more. There are easier ways to make lots of money than
coding all the hours God sends.
If you want to re-fight this old war, then fine, but at least
don't pretend it hasn't been fought before and that the assembly
side did not come out ahead. Research history and the territory before you
try fighting this over again.
Business wants to get product out the door. HLLs are amplifiers and
expert systems for code production. Sorry guys, you can beleive whatever
you want to believe, but there is a basic economic bottom line.
[reply]
[top]
[»]
University Teachings
by Adam Bellaire - Oct 7th 2000 18:16:44
This is a dangerous pitfall that is still being more or less 'taught' in
universities. I know because they tried to teach it to me. The fact is,
when I started studying here, the first time Assembly came up in my
Computer Organization course it was heavily emphasized that assembly was
used for its efficiency, but this course in assembly was taught BEFORE the
course in Software Engineering which taught about the emphasis that must be
given to correctness first. Different professors give different signals.
The intro professor made sure to make a point that C compilers are vastly
more efficient than most programmers, which makes sense since it was an
intro course and most of the people in it probably didn't have the
knowledge of asm to use it more effectively than the compiler. But there
really wasn't much emphases on this 'correctness first' issue, and the
workability and the 'software crisis' didn't come up until my junior
year!
I'm just glad to see an article about this, because I feel it gave a very
good overview in just a couple of pages of a process that's extremely
important and that I feel my education, despite its expense, failed to
address adequately.
[reply]
[top]
[»]
Optimization
by Matthias M Giwer - Oct 7th 2000 17:02:21
YMM(Will)V
I am at a loss as to why such an article still needs be written.
Make it work first; be clever later. There is very little market for
the fast routines that don't work. Schedules call for delivery of working
products.
Besides, it might be good enough the first time around and you get the
reputation of a miracle worker. Fast enough may be good enough. The "need
for speed" is for bragging rights not necessarily for the real world.
Better really is the enemy of good enough.
Despite the cynicism of some in the business, it is much easier to sell
improvements than explain failures. And in this high stress business having
a working product ahead of schedule reduces the Tagamet budget.
Of course it is always good to have a management name for it. "Proof of
concept," sounds good. I'm sure Dogbert can come up with a better one.
The point is well taken regarding today's compilers, although it does
depend upon the skill of those who write compilers, optimum code for an
8086 is not expected to be optimum for a PII - MMX comes to mind. The
suggestion to look at the compiler's machine code first, even if only as a
cheat sheet, should be mandatory. Compilers can't be optimal for every
procedure in every context but that is likely the last place for the
problem to be.
It is not a matter of defending or deriding ASM but rather leave it as
a last resort only if necessary.
Anyone who has gone the route of making it work first can tell some
great stories. My best example(from 14 years ago) is 18 minutes for the
first brute force version that worked correctly down to 45 seconds with
only two improvements. But then back up to 2 minutes when I "improved" it
again. If I had put all three improvements in the first version ... you can
see where that leads. And those improvements were in the TP not assembly.
It was all in the parts unique to the program not in things like a faster
sort.
Often it is best to use known tools wherever possible. The above
program took maybe fifty hours in TPascal and needed programs to write
program code and bloated up to about 8000 lines. Just a few months ago I
duplicated the unimproved version in a couple hours with an 18 line shell
script calling standard 'nix tools.
Considering the speed difference (8 v 333MHz) my test file back then
was Patrick Henry's Liberty or Death speech which gave those times. This
version does the entire novel 1984 in 40 some seconds with nothing but the
brute force approach with no improvements.
Speaking of hardware, consider the total cost of a programmer per hour
(roughly double the pay) and the hours it will take to write and test any
code improvements including ASM. Then see how that compares to the cost of
a faster machine in terms of the desired speed improvment. That is
significantly less than twenty hours labor for a 666 machine. (No one tell
Jerry Falwell the real speed.)
Again a few months ago, I knew how to render a triangle mesh in POVray
even though the resulting mesh was an 11Meg file. It worked (and managed to
crash KDE graphics terminals so it had to be rendered from a real console.)
Then I dug into the TARGA format and learned how to create one which
renders much faster and is under 200k file size. If I had gone direct to
.tga format would the bugs be in the program that generated the file or in
my understanding of the file format? or both? And I stuck with what I know,
TPascal (now released for linux and others as Free Pascal), rather than
trying one of the numerous linux languages I now have for free.
I was convinced back then KISS was best and nothing in the intervening
years has even come close to changing my opinion. It has never failed me.
As I think Heinlein said, all progress is due to the lazy man.
No program is ever bug free until all of the people whose lives
the last user affected have joined him in eternity.
[reply]
[top]
[»]
possible misinterpretations
by Erik Greenwald - Oct 7th 2000 16:52:26
I never said to never use assembly. I've done assembly on a few
different platforms (even the crude mnemonic coding done on commadore
64's via a monitor). I see a lot of people who feel re-writing in
assembly is the best first approach. I've seen projects where people
brag it will be fast because they're doing it almost entirely in
assembly. That's the kneejerk reaction I was referring to :)
Inline assembly: If you write any assembly at all, you are killing
portability. Period. Inline assembly gives you the chance to keep it all in
one place and do some preprocessor magic so a portable (possibly)
alternative is available. #ifdef __X86_GAS ...
High level languages: Most people develop in C/C++, so I used that as
the reference high level lang. I have some opengl code in scheme I wrote to
experiment with some algorithms, it is very sufficient. But most people
don't write in scheme, so *shrug* :)
'Big picture': If you either explicitely inline the
functions, or tell the compiler to inline what it can, then it will see
'the big picture'. Vital parts may benefit from recoding in
assembly, but first you need to see where those vital parts are, second you
need to try to call those vital parts as little as possible.
Kneejerk reaction: Portability is a key issue. Some of my friends who
use non-linux unices like to rant about how non-portable linux code is. Yet
more of my friends use non-x86 hardware with linux, and are at the mercy of
programmers to do the right thing in respect to endian and assembly issues.
Plenty of asm code is in asm because that's the right thing to do.
Thinking asm is an instant speed cure is wrong. The assembly section
mentions seeing exactly what gcc is doing, then seeing how to do it better.
Rewriting something in assembly that doesn't offer a real speedup is
pointless and wasteful. :) Amdahl's law and Occam's Razor.
Data structures: I should have made a mention of finding better data
structures in the section about algorithmic optimization. Thanks for
pointing that out :)
If anyone feels I said that assembly at all is wrong, I apologize. It
wasn't my intent to insult the language nor the people who use the
languge. Assembly has its place, and is required for certain things. Part
of my motivation in writing this was to counter the people who believe that
assembly is the first avenue to good optimization. There are a lot of steps
to take before using assembly that can result in much better improvements
to speed and/or size.
-Erik
[reply]
[top]
[»]
Optimization
by ehlarson - Oct 7th 2000 15:03:31
Interesting article, but I think it misses the most important form of
optimization of all - data representation. Often the way that you represent
the problem in data structures determines the algorithms that are available
to you, and thus how fast the problem can be solved. There are a number of
examples of this described in books like 'Programming Pearls' and 'The
Mythical Man Month'. My favorite is perhaps a bit esoteric, but a good
example nonetheless - use of the method of orthogonal collocation to
collapse a partial differntial equation to a set of ordinary differential
equations. This nifty mathematical trick alows you to reduce the
dimensionality of a problem by one, or reduce the big O by one - i.e. speed
up the solution by a factor of 10.
Finding a clever data representation is often the lightning bolt that
shatters the oak tree.
[reply]
[top]
[»]
One of my favorite tutorial about optimisation
by Eg0r - Oct 7th 2000 14:19:23
From Imaging Technology. Get the first
part here
and the second part here.
Okay, the journal concentrates on image processing, but it's no reason to
skip these excellent tutorial.
While the first part goes quite in depth on C optimisation, with tricks
like using pointers instead of increment variables, loop unrolling and
optimisation in assembly (the bad way at first), the second part explains
how MMX can be used to accelerate an assembly written version of the
algorithm.
Well written and definitely worth a read... Any other good tutorials on
the internet?
[reply]
[top]
[»]
C contra ASM
by Spider - Oct 7th 2000 12:47:28
Now this is really something to think through first, yes.
C -> asm is -not- probably better than handcoded asm.
to give a quick 'n' dirty example
function putpixl
{
blagh blagh made in C
}
function getpixl {
blagh blagh made in c
}
main big loop {
for x is 0 to 640
for y is 0 to 480
putpixl ( getpixl + getpixl + getpixl + getpixl /4)
;
;
once that piece of code is working, optimize! asm is much easier to
optimize with some experience in the matter. and very few compilers manage
to recode several function calls into a single block of code, and then
optimize it... simply becasue they lack the ability to get the overview.
Now, a compiler can do nice things to loops.. unroll them for
example... how well they do that in an assembler/c mix.. will, no fsck
idea... :P
a good bottleneck to check for is how dependent you are on others
code. if your greatbigohnice gfx of a sorting with comparsion doesn't
go fast, is it because of the sorting, or is it because that you use QT,
need extra 87Mb ram and are dependent on that? bad boy, back to square A
:p
in general, you can actually gain from 15-250% on recoding vital parts
in assembly. question is, is it worth it?
in general, things that gain from asm is:
low level libraries. anything from lists,sorting,gfx,stringmatching.
very intense loops/encoding/decoding.
would be interesting to see a comparsion of a C - png encode/decode
with an asm. derival...
/set rant_mode off
-- Socially Raped
[reply]
[top]
[»]
Higher-level languages
by Kaufmann - Oct 7th 2000 11:38:23
The observation that the compiler will often generate better assembly than
hand-coding also applies to a higher-level point-of-view. Applications
programmers who use C or C++ are often forced to reinvent the wheel of
common idioms and get caught up in low-level implementation details. This
is especially true for applications which require symbolic or declarative
functionality. Because of this, even though C/C++'s performance may be
better locally than that of languages such as Smalltalk, Python,
Haskell, ML, Prolog or even Common Lisp, quite often these languages will
provide more satisfactory global performance; and thanks to the
ever-improving optimising compilers for VHLLs, I've seen many large
applications in Ocaml or Common Lisp that could perform as well as their
C/C++ equivalents.
As for systems programming, a new compiler has been developed called
PreScheme which accepts a statically type-decidable subset of Scheme,
performs type-checking and compiles it to C with very satisfactory
results; this combines the expressiveness of nearly-full Scheme with the
performing edge of lower-level languages.
So, software in HLLs can be optimal too! :)
[reply]
[top]
[»]
Don't be so quick to insult assembly
by Dan Papasian - Oct 7th 2000 08:21:56
A proficent assembly programmer can easily write better
code than GCC for many of the more speed-intensive portions.
As for recommending inline assembly, it will hurt portability. It is not
defined in any ISO C standard, and exists only in gcc as an extension. You
will lose portability using it. (Of course, if you're using asm,
portability is probably not your first concern)
Usually the speed assembly gives you isn't needed, but in the case of
_very_ time critical code, it should not be ruled out.
[reply]
[top]
|
 |