This is mostly extracted from http://reviews.llvm.org/D18960.
The general idea for tlsdesc is that the two GD got entries are used
for a function pointer and its argument. The dynamic linker sets
both. In the non-dlopen case the dynamic linker sets the function to
the identity and the argument to the offset in the tls block.
All that the static linker has to do in the non-dlopen case is
relocate the code to point to the got entries and create a dynamic
relocation.
The dlopen case is more complicated, but can be implemented in another patch.
llvm-svn: 271569
Patch implements next relaxation from latest ABI:
"Convert memory operand of test and binop into immediate operand, where binop is one of adc, add, and, cmp, or,
sbb, sub, xor instructions, when position-independent code is disabled."
It is described in System V Application Binary Interface AMD64 Architecture Processor
Supplement Draft Version 0.99.8 (https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-r249.pdf,
B.2 "B.2 Optimize GOTPCRELX Relocations").
Differential revision: http://reviews.llvm.org/D20793
llvm-svn: 271405
MergedInputSection::getOffset is the busiest function in LLD if string
merging is enabled and input files have lots of mergeable sections.
It is usually the case when creating executable with debug info,
so it is pretty common.
The reason why it is slow is because it has to do faily complex
computations. For non-mergeable sections, section contents are
contiguous in output, so in order to compute an output offset,
we only have to add the output section's base address to an input
offset. But for mergeable strings, section contents are split for
merging, so they are not contigous. We've got to do some lookups.
We used to do binary search on the list of section pieces.
It is slow because I think it's hostile to branch prediction.
This patch replaces it with hash table lookup. Seems it's working
pretty well. Below is "perf stat -r10" output when linking clang
with debug info. In this case this patch speeds up about 4%.
Before:
6584.153205 task-clock (msec) # 1.001 CPUs utilized ( +- 0.09% )
238 context-switches # 0.036 K/sec ( +- 6.59% )
0 cpu-migrations # 0.000 K/sec ( +- 50.92% )
1,067,675 page-faults # 0.162 M/sec ( +- 0.15% )
18,369,931,470 cycles # 2.790 GHz ( +- 0.09% )
9,640,680,143 stalled-cycles-frontend # 52.48% frontend cycles idle ( +- 0.18% )
<not supported> stalled-cycles-backend
21,206,747,787 instructions # 1.15 insns per cycle
# 0.45 stalled cycles per insn ( +- 0.04% )
3,817,398,032 branches # 579.786 M/sec ( +- 0.04% )
132,787,249 branch-misses # 3.48% of all branches ( +- 0.02% )
6.579106511 seconds time elapsed ( +- 0.09% )
After:
6312.317533 task-clock (msec) # 1.001 CPUs utilized ( +- 0.19% )
221 context-switches # 0.035 K/sec ( +- 4.11% )
1 cpu-migrations # 0.000 K/sec ( +- 45.21% )
1,280,775 page-faults # 0.203 M/sec ( +- 0.37% )
17,611,539,150 cycles # 2.790 GHz ( +- 0.19% )
10,285,148,569 stalled-cycles-frontend # 58.40% frontend cycles idle ( +- 0.30% )
<not supported> stalled-cycles-backend
18,794,779,900 instructions # 1.07 insns per cycle
# 0.55 stalled cycles per insn ( +- 0.03% )
3,287,450,865 branches # 520.799 M/sec ( +- 0.03% )
72,259,605 branch-misses # 2.20% of all branches ( +- 0.01% )
6.307411828 seconds time elapsed ( +- 0.19% )
Differential Revision: http://reviews.llvm.org/D20645
llvm-svn: 270999
MIPS .reginfo and .MIPS.options sections are consumed by the linker, and
the linker produces a single output section. But it is possible that
input files contain section symbol points to the corresponding input
section. In case of generation a relocatable output we need to write
such symbols to the output file.
Fixes bug 27878.
Differential Revision: http://reviews.llvm.org/D20688
llvm-svn: 270910
System V Application Binary Interface AMD64 Architecture Processor Supplement Draft Version 0.99.8
(https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-r249.pdf, B.2 "B.2 Optimize GOTPCRELX Relocations")
introduces possible relaxations for R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX.
That patch implements the next relaxation:
mov foo@GOTPCREL(%rip), %reg => lea foo(%rip), %reg
and also opens door for implementing all other ones.
Implementation was suggested by Rafael Ávila de Espíndola with few additions and testcases by myself.
Differential revision: http://reviews.llvm.org/D15779
llvm-svn: 270705
Previously, mergeable section's constructors did more than just
setting member variables; it split section contents into small
pieces. It is not always computationally cheap task because if
the section is a mergeable string section, it needs to scan the
entire section to split them by NUL characters.
If a section would be thrown away by GC, that cost ended up
being a waste of time. It is going to be larger problem if the
section is compressed -- the whole time to uncompress it and
split it up is going to be a waste.
Luckily, we can defer section splitting after GC. We just have
to remember which offsets are in use during GC and apply that later.
This patch implements it.
Differential Revision: http://reviews.llvm.org/D20516
llvm-svn: 270455
This patch adds Size member to SectionPiece so that getRangeAndSize
can just return a SectionPiece instead of a std::pair<SectionPiece *, uint_t>.
Also renamed the function.
llvm-svn: 270346
We were using std::pair to represents pieces of splittable section
contents. It hurt readability because "first" and "second" are not
meaningful. This patch give them names.
One more thing is that piecewise liveness information is stored to
the second element of the pair as a special value of output section
offset. It was confusing, so I defiend a new bit, "Live", in the
new struct.
llvm-svn: 270340
This makes it explicit that each R_RELAX_TLS_* is equivalent to some
other expression.
With this I think we are at a sweet spot for how much is done in
Target.cpp. I did experiment with moving *all* the value math out of it.
It has the advantage that we know the final value in target independent
code, but it gets quite verbose.
llvm-svn: 270277
This adds direct support for computing offsets from the thread pointer
for both variants. Of the architectures we support, variant 1 is used
only by aarch64 (but that doesn't seem to be documented anywhere.)
llvm-svn: 270243
New names reflect purpose of corresponding GOT entries better.
Both expression types related to entries allocated in the 'local'
part of MIPS GOT. R_MIPS_GOT_LOCAL_PAGE is for entries contain 'page'
addresses. R_MIPS_GOT_LOCAL is for entries contain 'full' address.
llvm-svn: 269597
We were previously using an output offset of -1 for both GC'd and tail
merged pieces. We need to distinguish these two cases in order to filter
GC'd symbols from the symbol table -- we were previously asserting when we
asked for the VA of a symbol pointing into a dead piece, which would end
up asking the tail merging string table for an offset even though we hadn't
initialized it properly.
This patch fixes the bug by using an offset of -1 to exclusively mean GC'd
pieces, using 0 for tail merges, and distinguishing the tail merge case from
an offset of 0 by asking the output section whether it is tail merge.
Differential Revision: http://reviews.llvm.org/D19953
llvm-svn: 268604
MIPS N64 ABI introduces .MIPS.options section which specifies miscellaneous
options to be applied to an object/shared/executable file. LLVM as well as
modern versions of GNU tools read and write the only type of the options -
ODK_REGINFO. It is exact copy of .reginfo section used by O32 ABI.
llvm-svn: 268485
Relocations against sections with no SHF_ALLOC bit are R_ABS relocations.
Currently we are creating Relocations vector for them, but that is wasteful.
This patch is to skip vector construction and to directly apply relocations
in place.
This patch seems to be pretty effective for large executables with debug info.
r266158 (Rafael's patch to change the way how we apply relocations) caused a
temporary performance degradation for such executables, but this patch makes
it even faster than before.
Time to link clang with debug info (output size is 1070 MB):
before r266158: 15.312 seconds (0%)
r266158: 17.301 seconds (+13.0%)
Head: 16.484 seconds (+7.7%)
w/patch: 13.166 seconds (-14.0%)
Differential Revision: http://reviews.llvm.org/D19645
llvm-svn: 267917
The fix is to handle local symbols referring to SHF_MERGE sections.
Original message:
GC entries of SHF_MERGE sections.
It is a fairly direct extension of the gc algorithm. For merge sections
instead of remembering just a live bit, we remember which offsets
were used.
This reduces the .rodata sections in chromium from 9648861 to 9477472
bytes.
llvm-svn: 267233
It is a fairly direct extension of the gc algorithm. For merge sections
instead of remembering just a live bit, we remember which offsets were
used.
This reduces the .rodata sections in chromium from 9648861 to 9477472
bytes.
llvm-svn: 267164
It turns out that this will read data from the section to properly
handle Elf_Rel implicit addends.
Sorry for the noise.
Original messages:
Try to fix Windows lld build.
Move getRelocTarget to ObjectFile.
It doesn't use anything from the InputSection.
llvm-svn: 267163
This requires adding a few more expression types, but is already a small
simplification. Having Writer.cpp know the exact expression will also
allow further simplifications.
llvm-svn: 266604
With this patch we use the first scan over the relocations to remember
the information we found about them: will them be relaxed, will a plt be
used, etc.
With that the actual relocation application becomes much simpler. That
is particularly true for the interfaces in Target.h.
This unfortunately means that we now do two passes over relocations for
non SHF_ALLOC sections. I think this can be solved by factoring out the
code that scans a single relocation. It can then be used both as a scan
that record info and for a dedicated direct relocation of non SHF_ALLOC
sections.
I also think it is possible to reduce the number of enum values by
representing a target with just an OutputSection and an offset (which
can be from the start or end).
This should unblock adding features like relocation optimizations.
llvm-svn: 266158