[Dcmlib] std::map vs std::hash_map

Sean McInerney seanm at nmr.mgh.harvard.edu
Thu Jul 7 02:39:01 CEST 2005


Mathieu Malaterre wrote:
>> I didn't catch exactly what you did with the union.
>>
>> I suppose the uint32_t tagkey is the 'concatenation' of Dicom Group
>> number and Dicom Element number.
>> What is uint16_t tab used for ?
>>
>> In the original version, we did something stupid : transform the tag
>> (e.g 0x0018,0x0050) into a TagKey ( e.g "0018|0050")
>> We can save time removing the transformation tag -> TagKey.
>> But we don't know (maybe Sean  knows) the way the hash function should
>> act.
>> (in the very late 70's -there was no stl ...-  we splitted the 'key'
>> into 4 bytes pieces, Xor'ed the pieces to get a 'random like' int32,
>> and divide it by the number of blocks, to get the target block number,
>> and perform a chain when collision occurs)
>>
>> I think (I'm not sure), nowadays Htable methods that use 'tries' look
>> at the first bits of the number to know whether they have to split the
>> target block or not, right ?
>>
>> The first uint16 of the tag has no more than 100 distinct values (65
>> in the Dicom V3 directory, + a few 'shadow groups'),
>> The second uint16 of the tag has more than 1250 distinct values, the
>> Dicom V3 directory.
>> Maybe we could have something more efficent if we describe the key as
>> 'Element number Group number' instead of 'Group number Element number '.
>> Or maybe my idea is meaningless :-(
> 
> 
> 
> In fact even when properly coding meaning always passing by const
> reference and return reference... There are always cases where string is
> just pain (comparisons in particular). Which is the case in the hash
> table, we need to quickly compare two TagKey. I first thought of simply
> replacing TagKey by uint16_t t[2]. But then when you return this from a
> function it becomes a pointer, so you have to deals with NULL pointer ->
> this is a pain.
> 
> The other solutions I thought of where: std::pair<uint16_t, uint16_t>,
> struct { uint16_t[2] }; But I found the union the easiest/fastest to
> implement because the notion of order is implicit. And on a 32bits
> machine comparing two ints should be extremely fast. Therefore we read
> in two uint16_t, but the hash table (just like most of my local gdcm)
> only 'SEE' one uint32_t...
> 
> Therefore the hash is really immediate (remember the improvement was 23%!).
> 
> The only problem I have to deals with now is the representation of a
> TagKey within a sequence.
> 
> 
> Does this make sense now, JP ?
> 
>> Ten more days to wait for the Bastille Day.
>> Mathieu, don't forget to bring some fire crakers, at Kitware's.
> 
> 
> I bought a few M-80 from another state ;-)
> 
> Mathieu
> 
> 
Wow firecrackers on Bastille Day, I thought you guys just built
barricades or something. ;-)

I've been on a bit of wild goose chase WRT map types and hashing. I even
went so far as to convert my array-based hash map to an STL-compliant
container with all the required iterators, allocators, algorithms, etc.
After a solid day of work, my benchmarks show performance about
equivalent to stdext::hash_map or __gnu_cxx::hash_map. I'm sure most of
the performance loss is due to use of std::string.

Of course, avoiding strings as keys altogether can reap significant
performance gains, but you'll have to get clever with your hash if your
keys are not densely spaced in their range of values. I have not tested
it yet, but Google's sparse hash implementation may be worth a look ...
and you know they care about speed.

If you can hash on integers, I think you'll want the modulus of the tag
number by the number of blocks (as opposed to dividing) ... maybe???

The simplest solution may be using the stock map or hash_map types, but
keyed by 'const char*' rather than std::string. I think that you can
safely get away with this since GDCM seems to build the maps in question
from string literals at compile time. If the map data already exists in
text segment of the executable, not copying the data into heap space
(ala std::string) could be a significant optimization.

This problem still has my interest ... so I'll keep you posted with any
interesting results or ideas.

-Sean



More information about the Dcmlib mailing list