actor - Can I avoid cache consistency checks by declaring variables as thread-local? -
i'm reading how cpus maintain consistency of caches in multithreaded application. write in 1 core's cache labels dirty , other cores must careful not read segment main memory, because main memory copy not date.
many of applications write work like actor system, mutability limited local variables on single thread. don't label them "thread local" unless have semantics reason doing so.
however, missing optimization opportunity? explicitly labeling variable thread local, opposed using way, inform hardware doesn't have check consistency because variable never visible other threads, in principle?
edit: higher-level way of expressing same thing, should expect performance gains using formal actor system, akka, instead of adhering actor paradigm in classes? formal actor system adds strictness, ability scale across computers, , presumably overhead, low-level details letting threads skip consistency checks on cached data known non-shared?
does labeling data "thread local"?
as long avoid false sharing, you're fine. i.e. make sure static data used 1 thread isn't in same cache line static data used thread. checking memory order machine-clears perf event.
if find program has false-sharing, re-arrange order of declarations (since compilers tend store things in order they're declared), or linker sections choose how things grouped in static storage. struct or array give guarantees memory layout.
tl;dr: avoid putting 2 variables in same cache line (often 64b) if variables used different threads. of course, group things are modified @ same time same thread.
thread-local variables solve different problem. let same function access different static variable depending on thread called it. alternative passing around pointers.
they're still stored in memory other static
/ global variables. can sure there's no false sharing, there cheaper ways avoid that.
the difference between thread-local vars , "normal" globals in how they're addressed. instead of accessing them through absolute address, it's offset thread-local storage block.
on x86, done segment-override prefixes. e.g. mov rax, qword ptr fs:0x28
loads byte 0x28 inside thread-local storage block (since each thread's fs
segment register loaded offset of own tls block).
so tls isn't free. don't use if don't need it. can cheaper passing pointers around, though.
there's no way let hardware skip cache coherency checks, because hardware doesn't have notion of tls. there stores , loads to/from memory, , the ordering guarantees provided isa. since tls trick getting same function use different addresses different callers, software bug in implementing tls result in stores same address. hardware doesn't let buggy software break cache coherency in way, since potentially break privilege separation.
on weakly-ordered architectures, memory_order_consume
(in theory) way arrange inter-thread data dependencies such writes shared data have waited other threads, not writes thread-private data.
however, hard compilers safely , reliably right, implement mo_consume stronger mo_acquire. wrote a long , rambling answer while ago bunch of links memory-ordering stuff, , mention of c++11 memory_order_consume.
it's hard standardize because different architectures have different rules operations carry dependency. assume code-bases have hand-written asm takes advantage of dependency ordering. afaik, hand-written asm way take advantage of dependency ordering avoid memory barrier instructions on weakly-ordered isa. (e.g. in producer-consumer model, or in lockless algorithms need more unordered atomic stores.)
Comments
Post a Comment