"Dmitriy V'jukov" wrote in message news: On Sep 29, 4:23 pm, "Chris M. Thomasson" wrote:
> >> Are you using something like the algorithm in TBB? > > >http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-548.pdf > > > TBB is based on MIT's Cilk work: > >http://www.cilk.com/multicore-blog/bid/5607/Cilk-Wins-Most-Influentia... > > How does the internal low-level impl compare to the following: > > http://www.cs.bgu.ac.il/~hendlerd/papers/dynamic-size-deque.pdf > > I am looking for raw pseudo-code for atomic deque internal impl > details... > AFAICT, this work from SUN would scale better than Clik. Please correct > me > if I am way off base here. It seems like spawning a successor thread has > overheads... Humm. Pleas try to bear with me here; okay? Correct my > ignorance on Clik's work-stealing internal impl... Well, let me pick an > impl > to focus on... Say, DEC Alpha?
AFAIK, in early days Cilk's work-stealing deque used mutex-based pop(). But I remember there was some mentions of non-blocking algorithms in the Cilk's papers, something like "some people point us that it's possible to implement work-stealing deque in completely non- blocking manner". And I don't know whether non-blocking deque was finally incorporated into Cilk.
If mutex is spin-mutex (i.e. there is only 1 atomic RMW per lock/ unlock) and stealing is rare, then mutex-based deque is nearly the same as non-blocking deque with 1 RMW... provided that push() doesn't use mutex. And provided that atomic RMW has the same cost as StoreLoad memory fence (x86).
You mean:
1 atomic RMW and 1 StoreLoad membar for lock 1 atomic RMW and 1 LoadStore membar for unlock
2 atomics, and 2 membars; fairly expensive... Well... You indeed have a good point in the case that stealing is rare...
"Dmitriy V'jukov" <dvyukov@gmail.com> wrote in message
news:4306b243-2d28-4ec6-a275-883db633b9b8@l42g2000hsc.googlegroups.com...
On Sep 29, 4:23 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> >> Are you using something like the algorithm in TBB?
>
> >http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-548.pdf
>
> > TBB is based on MIT's Cilk work:
> >http://www.cilk.com/multicore-blog/bid/5607/Cilk-Wins-Most-Influentia...
>
> How does the internal low-level impl compare to the following:
>
> http://www.cs.bgu.ac.il/~hendlerd/papers/dynamic-size-deque.pdf
>
> I am looking for raw pseudo-code for atomic deque internal impl
> details...
> AFAICT, this work from SUN would scale better than Clik. Please correct
> me
> if I am way off base here. It seems like spawning a successor thread has
> overheads... Humm. Pleas try to bear with me here; okay? Correct my
> ignorance on Clik's work-stealing internal impl... Well, let me pick an
> impl
> to focus on... Say, DEC Alpha?
AFAIK, in early days Cilk's work-stealing deque used mutex-based
pop(). But I remember there was some mentions of non-blocking
algorithms in the Cilk's papers, something like "some people point us
that it's possible to implement work-stealing deque in completely non-
blocking manner". And I don't know whether non-blocking deque was
finally incorporated into Cilk.
If mutex is spin-mutex (i.e. there is only 1 atomic RMW per lock/
unlock) and stealing is rare, then mutex-based deque is nearly the
same as non-blocking deque with 1 RMW... provided that push() doesn't
use mutex. And provided that atomic RMW has the same cost as StoreLoad
memory fence (x86).
You mean:
1 atomic RMW and 1 StoreLoad membar for lock
1 atomic RMW and 1 LoadStore membar for unlock
2 atomics, and 2 membars; fairly expensive... Well... You indeed have a good
point in the case that stealing is rare...
"Dmitriy V'jukov" wrote in message news: On Sep 29, 4:23 pm, "Chris M. Thomasson" wrote:
> >> Are you using something like the algorithm in TBB? > > >http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-548.pdf > > > TBB is based on MIT's Cilk work: > >http://www.cilk.com/multicore-blog/bid/5607/Cilk-Wins-Most-Influentia... > > How does the internal low-level impl compare to the following: > > http://www.cs.bgu.ac.il/~hendlerd/papers/dynamic-size-deque.pdf > > I am looking for raw pseudo-code for atomic deque internal impl > details... > AFAICT, this work from SUN would scale better than Clik. Please correct > me > if I am way off base here. It seems like spawning a successor thread has > overheads... Humm. Pleas try to bear with me here; okay? Correct my > ignorance on Clik's work-stealing internal impl... Well, let me pick an > impl > to focus on... Say, DEC Alpha?
AFAIK, in early days Cilk's work-stealing deque used mutex-based pop(). But I remember there was some mentions of non-blocking algorithms in the Cilk's papers, something like "some people point us that it's possible to implement work-stealing deque in completely non- blocking manner". And I don't know whether non-blocking deque was finally incorporated into Cilk.
If mutex is spin-mutex (i.e. there is only 1 atomic RMW per lock/ unlock) and stealing is rare, then mutex-based deque is nearly the same as non-blocking deque with 1 RMW... provided that push() doesn't use mutex. And provided that atomic RMW has the same cost as StoreLoad memory fence (x86).
You mean:
1 atomic RMW and 1 StoreLoad membar for lock 1 atomic RMW and 1 LoadStore membar for unlock
2 atomics, and 2 membars; fairly expensive... Well... You indeed have a good point in the case that stealing is rare...
Dmitriy V'jukov
On Sep 29, 5:02 pm, "Chris M. Thomasson" wrote:
You mean:
1 atomic RMW and 1 StoreLoad membar for lock 1 atomic RMW and 1 LoadStore membar for unlock
2 atomics, and 2 membars; fairly expensive... Well... You indeed have a g ood point in the case that stealing is rare...
No, I mean for spin-mutex: 1 atomic RMW + 1 StoreLoad membar for lock store-release for unlock
Which for x86 means: 1 atomic RMW for lock (StoreLoad is implied, and so basically costless) plain store for unlock (release is implied)
Dmitriy V'jukov
On Sep 29, 5:02 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
You mean:
1 atomic RMW and 1 StoreLoad membar for lock
1 atomic RMW and 1 LoadStore membar for unlock
2 atomics, and 2 membars; fairly expensive... Well... You indeed have a g ood
point in the case that stealing is rare...
No, I mean for spin-mutex:
1 atomic RMW + 1 StoreLoad membar for lock
store-release for unlock
Which for x86 means:
1 atomic RMW for lock (StoreLoad is implied, and so basically
costless)
plain store for unlock (release is implied)
1 atomic RMW and 1 StoreLoad membar for lock 1 atomic RMW and 1 LoadStore membar for unlock
2 atomics, and 2 membars; fairly expensive... Well... You indeed have a g ood point in the case that stealing is rare...
No, I mean for spin-mutex: 1 atomic RMW + 1 StoreLoad membar for lock store-release for unlock
Which for x86 means: 1 atomic RMW for lock (StoreLoad is implied, and so basically costless) plain store for unlock (release is implied)
Dmitriy V'jukov
Chris M. Thomasson
"Dmitriy V'jukov" wrote in message news: On Sep 29, 5:02 pm, "Chris M. Thomasson" wrote:
> You mean: > > 1 atomic RMW and 1 StoreLoad membar for lock > 1 atomic RMW and 1 LoadStore membar for unlock > > 2 atomics, and 2 membars; fairly expensive... Well... You indeed have a > good > point in the case that stealing is rare...
No, I mean for spin-mutex: 1 atomic RMW + 1 StoreLoad membar for lock store-release for unlock
Ahhh, I fail see spin-mutex and thing adaptive mutex. Sorry for my confusion.
Which for x86 means: 1 atomic RMW for lock (StoreLoad is implied, and so basically costless)
IMHO, its not costless. Not at all...
plain store for unlock (release is implied)
Indeed.
"Dmitriy V'jukov" <dvyukov@gmail.com> wrote in message
news:dd05d47f-9502-43a0-ac49-844bbe34c6c7@w7g2000hsa.googlegroups.com...
On Sep 29, 5:02 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> You mean:
>
> 1 atomic RMW and 1 StoreLoad membar for lock
> 1 atomic RMW and 1 LoadStore membar for unlock
>
> 2 atomics, and 2 membars; fairly expensive... Well... You indeed have a
> good
> point in the case that stealing is rare...
No, I mean for spin-mutex:
1 atomic RMW + 1 StoreLoad membar for lock
store-release for unlock
Ahhh, I fail see spin-mutex and thing adaptive mutex. Sorry for my
confusion.
Which for x86 means:
1 atomic RMW for lock (StoreLoad is implied, and so basically
costless)
"Dmitriy V'jukov" wrote in message news: On Sep 29, 5:02 pm, "Chris M. Thomasson" wrote:
> You mean: > > 1 atomic RMW and 1 StoreLoad membar for lock > 1 atomic RMW and 1 LoadStore membar for unlock > > 2 atomics, and 2 membars; fairly expensive... Well... You indeed have a > good > point in the case that stealing is rare...
No, I mean for spin-mutex: 1 atomic RMW + 1 StoreLoad membar for lock store-release for unlock
Ahhh, I fail see spin-mutex and thing adaptive mutex. Sorry for my confusion.
Which for x86 means: 1 atomic RMW for lock (StoreLoad is implied, and so basically costless)
IMHO, its not costless. Not at all...
plain store for unlock (release is implied)
Indeed.
Chris M. Thomasson
"Chris M. Thomasson" wrote in message news:7ImEk.572$
"Dmitriy V'jukov" wrote in message news: On Sep 29, 5:02 pm, "Chris M. Thomasson" wrote:
> You mean: > > 1 atomic RMW and 1 StoreLoad membar for lock > 1 atomic RMW and 1 LoadStore membar for unlock > > 2 atomics, and 2 membars; fairly expensive... Well... You indeed have a > good > point in the case that stealing is rare...
No, I mean for spin-mutex: 1 atomic RMW + 1 StoreLoad membar for lock store-release for unlock
Ahhh, I fail see spin-mutex and thing adaptive mutex.
Let me rephrase...
I see spin-mutex and thought adaptive mutex.
Sorry for my confusion.
:^/
Which for x86 means: 1 atomic RMW for lock (StoreLoad is implied, and so basically costless)
IMHO, its not costless. Not at all...
Bus locking slow-path... Cache locking fast-path...
plain store for unlock (release is implied)
Indeed.
Even this operation has excessive costs when compared to an arch that does not have these implied membars... RMO SPARC has its advantages indeed. TSO compared to RMO? Which one is more light weight and able to scale... I personally prefer a fine-grain model over TSO any day...
"Chris M. Thomasson" <no@spam.invalid> wrote in message
news:7ImEk.572$kc.394@newsfe12.iad...
"Dmitriy V'jukov" <dvyukov@gmail.com> wrote in message
news:dd05d47f-9502-43a0-ac49-844bbe34c6c7@w7g2000hsa.googlegroups.com...
On Sep 29, 5:02 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> You mean:
>
> 1 atomic RMW and 1 StoreLoad membar for lock
> 1 atomic RMW and 1 LoadStore membar for unlock
>
> 2 atomics, and 2 membars; fairly expensive... Well... You indeed have a
> good
> point in the case that stealing is rare...
No, I mean for spin-mutex:
1 atomic RMW + 1 StoreLoad membar for lock
store-release for unlock
Ahhh, I fail see spin-mutex and thing adaptive mutex.
Let me rephrase...
I see spin-mutex and thought adaptive mutex.
Sorry for my confusion.
:^/
Which for x86 means:
1 atomic RMW for lock (StoreLoad is implied, and so basically
costless)
IMHO, its not costless. Not at all...
Bus locking slow-path... Cache locking fast-path...
plain store for unlock (release is implied)
Indeed.
Even this operation has excessive costs when compared to an arch that does
not have these implied membars... RMO SPARC has its advantages indeed. TSO
compared to RMO? Which one is more light weight and able to scale... I
personally prefer a fine-grain model over TSO any day...
"Chris M. Thomasson" wrote in message news:7ImEk.572$
"Dmitriy V'jukov" wrote in message news: On Sep 29, 5:02 pm, "Chris M. Thomasson" wrote:
> You mean: > > 1 atomic RMW and 1 StoreLoad membar for lock > 1 atomic RMW and 1 LoadStore membar for unlock > > 2 atomics, and 2 membars; fairly expensive... Well... You indeed have a > good > point in the case that stealing is rare...
No, I mean for spin-mutex: 1 atomic RMW + 1 StoreLoad membar for lock store-release for unlock
Ahhh, I fail see spin-mutex and thing adaptive mutex.
Let me rephrase...
I see spin-mutex and thought adaptive mutex.
Sorry for my confusion.
:^/
Which for x86 means: 1 atomic RMW for lock (StoreLoad is implied, and so basically costless)
IMHO, its not costless. Not at all...
Bus locking slow-path... Cache locking fast-path...
plain store for unlock (release is implied)
Indeed.
Even this operation has excessive costs when compared to an arch that does not have these implied membars... RMO SPARC has its advantages indeed. TSO compared to RMO? Which one is more light weight and able to scale... I personally prefer a fine-grain model over TSO any day...