intbitset¶
About¶
Provides an intbitset
data object holding unordered sets of unsigned
integers with ultra fast set operations, implemented via bit vectors
and Python C extension to optimize speed and memory usage.
Emulates the Python built-in set class interface with some additional specific methods such as its own fast dump and load marshalling functions.
intbitset
additionally support the pickle protocol,
the iterator protocol
and can behave like a sequence
type.
Usage¶
Example:
>>> from intbitset import intbitset
>>> x = intbitset([1,2,3])
>>> y = intbitset([3,4,5])
>>> x & y
intbitset([3])
>>> x | y
intbitset([1, 2, 3, 4, 5])
Notes¶
- Uses real bits to optimize memory usage, so may have issues with endianness
if you transport serialized bitsets between various machine architectures.
- Please note that no bigger than
__maxelem__
elements can be added to anintbitset
. - On modern CPUs, vectorial instruction sets (such as MMX/SSE) are exploited
to further optimize speed.
Performance¶
Here is an example of performance gain with respect to traditional set
of
positive integers (example of ipython session):
>>> ## preparation
>>> from intbitset import intbitset
>>> from random import sample
>>> sparse_population1 = sample(range(1000000), 10000)
>>> sparse_population2 = sample(range(1000000), 10000)
>>> dense_population1 = sample(range(1000000), 900000)
>>> dense_population2 = sample(range(1000000), 900000)
>>> sparse_set1 = set(sparse_population1)
>>> sparse_set2 = set(sparse_population2)
>>> sparse_intbitset1 = intbitset(sparse_population1)
>>> sparse_intbitset2 = intbitset(sparse_population2)
>>> dense_set1 = set(dense_population1)
>>> dense_set2 = set(dense_population2)
>>> dense_intbitset1 = intbitset(dense_population1)
>>> dense_intbitset2 = intbitset(dense_population2)
>>> sorted(sparse_population1)[5000:5002]
[500095, 500124]
>>> in_sparse = 500095
>>> not_in_sparse = 500096
>>> sorted(dense_population1)[500000:500002]
[555705, 555707]
>>> in_dense = 555705
>>> not_in_dense = 555706
For sparse sets, intbitset
operations are typically 50 times faster than
set operations.
>>> ## Sparse sets operations
>>> %timeit sparse_set1 & sparse_set2
1000 loops, best of 3: 263 µs per loop
>>> %timeit sparse_intbitset1 & sparse_intbitset2 ## more than 20 times faster
100000 loops, best of 3: 11.6 µs per loop
>>> %timeit sparse_set1 | sparse_set2
1000 loops, best of 3: 891 µs per loop
>>> %timeit sparse_intbitset1 | sparse_intbitset2 ## almost 70 times faster
100000 loops, best of 3: 12.8 µs per loop
>>> %timeit sparse_set1 ^ sparse_set2
1000 loops, best of 3: 1.09 ms per loop
>>> %timeit sparse_intbitset1 ^ sparse_intbitset2 ## more than 80 times faster
100000 loops, best of 3: 12.9 µs per loop
>>> %timeit sparse_set1 - sparse_set2
1000 loops, best of 3: 739 µs per loop
>>> %timeit sparse_intbitset1 - sparse_intbitset2 ## almost 60 times faster
100000 loops, best of 3: 12.5 µs per loop
For dense sets, intbitset
operations are typically 5000 times faster
than set operations:
>>> ## Dense sets operations
>>> %timeit dense_set1 & dense_set2
10 loops, best of 3: 62.1 ms per loop
>>> %timeit dense_intbitset1 & dense_intbitset2 ## more than 5000 times faster
100000 loops, best of 3: 12.3 µs per loop
>>> %timeit dense_set1 | dense_set2
10 loops, best of 3: 84.1 ms per loop
>>> %timeit dense_intbitset1 | dense_intbitset2 ## more than 6000 times faster
100000 loops, best of 3: 12.5 µs per loop
>>> %timeit dense_set1 ^ dense_set2
10 loops, best of 3: 64.2 ms per loop
>>> %timeit dense_intbitset1 ^ dense_intbitset2 ## more than 5000 times faster
100000 loops, best of 3: 12.6 µs per loop
>>> %timeit dense_set1 - dense_set2
10 loops, best of 3: 38.6 ms per loop
>>> timeit dense_intbitset1 - dense_intbitset2 ## more than 3000 times faster
100000 loops, best of 3: 12.8 µs per loop
Membership operations in intbitset
behave in a comparable way than set
objects, albeit with slightly better performance:
>>> ## Membership tests
>>> %timeit in_sparse in sparse_set1
10000000 loops, best of 3: 66.8 ns per loop
>>> %timeit in_sparse in sparse_intbitset1 ## 1.5 times faster
10000000 loops, best of 3: 42.8 ns per loop
>>> %timeit not_in_sparse in sparse_set1
10000000 loops, best of 3: 71.3 ns per loop
>>> %timeit not_in_sparse in sparse_intbitset1 ## 1.6 times faster
10000000 loops, best of 3: 44.7 ns per loop
>>> %timeit in_dense in dense_set1
10000000 loops, best of 3: 61.8 ns per loop
>>> %timeit in_dense in dense_intbitset1 ## 1.3 times faster
10000000 loops, best of 3: 45.3 ns per loop
>>> %timeit not_in_dense in dense_set1
10000000 loops, best of 3: 45.5 ns per loop
>>> %timeit not_in_dense in dense_intbitset1 ## similar speed
10000000 loops, best of 3: 41.4 ns per loop
Serialising can be up to 30 times faster:
>>> ## serialization speed
>>> ## note: internally intbitset compress using zlib so we are
>>> ## going to also compress the equivalent set
>>> from zlib import compress, decompress
>>> from marshal import dumps, loads
>>> %timeit loads(decompress(compress(dumps(sparse_set1))))
100 loops, best of 3: 6.55 ms per loop
>>> %timeit intbitset(sparse_intbitset1.fastdump()) ## 15% faster
100 loops, best of 3: 5.63 ms per loop
>>> %timeit loads(decompress(compress(dumps(dense_set1))))
1 loops, best of 3: 565 ms per loop
>>> %timeit intbitset(dense_intbitset1.fastdump()) ## almost 30 times faster for dense sets
10 loops, best of 3: 20.9 ms per loop
Serialising can lead to 20 times smaller footprint:
>>> len(compress(dumps(sparse_set1)))
29349
>>> len(sparse_intbitset1.fastdump()) ## almost half the space
16166
>>> len(compress(dumps(dense_set1)))
1363026
>>> len(dense_intbitset1.fastdump()) ## 5% of the space for dense set
70332
Reference¶
Defines an intbitset data object to hold unordered sets of unsigned integers with ultra fast set operations, implemented via bit vectors and Python C extension to optimize speed and memory usage.
Emulates the Python built-in set class interface with some additional specific methods such as its own fast dump and load marshalling functions. Uses real bits to optimize memory usage, so may have issues with endianness if you transport serialized bitsets between various machine architectures.
Please note that no bigger than __maxelem__ elements can be added to an intbitset and, if CFG_INTBITSET_ENABLE_SANITY_CHECKS is disabled, you will receive unpredictable results.
Note to developers: If you make modification to this file you have to manually regenerate intbitset.c by running:
$ cython intbitset.pyx
and then commit generated intbitset.c.
-
class
intbitset.
intbitset
¶ Defines an intbitset data object to hold unordered sets of unsigned integers with ultra fast set operations, implemented via bit vectors and Python C extension to optimize speed and memory usage.
Emulates the Python built-in set class interface with some additional specific methods such as its own fast dump and load marshalling functions. Uses real bits to optimize memory usage, so may have issues with endianness if you transport serialized bitsets between various machine architectures.
- The constructor accept the following parameters:
- rhs=0, int preallocate=-1, int trailing_bits=0, bint sanity_checks=CFG_INTBITSET_ENABLE_SANITY_CHECKS, int no_allocate=0:
where rhs can be:
int/long
for creating allocating empty intbitset that will hold at least rhs elements, before being resizedintbitset
for cloningbytes
for retrieving an intbitset that was dumped into a byte stringarray
for retrieving an intbitset that was dumped into a string stored in an array- sequence made of integers for copying all the elements from the sequence. If minsize is specified than it is initially allocated enough space to hold up to minsize integers, otherwise the biggest element of the sequence will be used.
- sequence made of tuples: then the first element of each tuple is considered as an integer (as in the sequence made of integers).
The other arguments are:
preallocate
is a suggested initial upper bound on the numbers that will be stored, by looking at rhs a sequence of number.trailing_bits
is 1, then the set will contain “all” the positive integersno_allocate
andsanity_checks
are used internally and should never be set.
-
__and__
¶ Return the intersection of two intbitsets as a new set. (i.e. all elements that are in both intbitsets.)
-
__cmp__
¶
-
__contains__
¶ x.__contains__(y) <==> y in x
-
__deepcopy__
()¶
-
__delitem__
¶ x.__delitem__(y) <==> del x[y]
-
__eq__
¶ x.__eq__(y) <==> x==y
-
__ge__
¶ x.__ge__(y) <==> x>=y
-
__getitem__
¶ x.__getitem__(y) <==> x[y]
-
__gt__
¶ x.__gt__(y) <==> x>y
-
__hash__
¶
-
__iadd__
¶ x.__iadd__(y) <==> x+=y
-
__iand__
¶ Update a intbitset with the intersection of itself and another.
-
__ior__
¶ Update a intbitset with the union of itself and another.
-
__isub__
¶ Remove all elements of another set from this set.
-
__iter__
¶
-
__ixor__
¶ Update an intbitset with the symmetric difference of itself and another.
-
__le__
¶ x.__le__(y) <==> x<=y
-
__len__
¶
-
__lt__
¶ x.__lt__(y) <==> x<y
-
__ne__
¶ x.__ne__(y) <==> x!=y
-
__new__
(S, ...) → a new object with type S, a subtype of T¶
-
__nonzero__
¶ x.__nonzero__() <==> x != 0
-
__or__
¶ Return the union of two intbitsets as a new set. (i.e. all elements that are in either intbitsets.)
-
__pyx_vtable__
= <capsule object NULL>¶
-
__rand__
¶ x.__rand__(y) <==> y&x
-
__reduce__
()¶ helper for pickle
-
__repr__
¶
-
__ror__
¶ x.__ror__(y) <==> y|x
-
__rsub__
¶ x.__rsub__(y) <==> y-x
-
__rxor__
¶ x.__rxor__(y) <==> y^x
-
__safe_for_unpickling__
= True¶
-
__setitem__
¶ x.__setitem__(i, y) <==> x[i]=y
-
__str__
¶
-
__sub__
¶ Return the difference of two intbitsets as a new set. (i.e. all elements that are in this intbitset but not the other.)
-
__xor__
¶ Return the symmetric difference of two sets as a new set. (i.e. all elements that are in exactly one of the sets.)
-
add
()¶ Add an element to a set. This has no effect if the element is already present.
-
clear
()¶
-
copy
()¶ Return a shallow copy of a set.
-
difference
()¶ Return a new intbitset with elements from the intbitset that are not in the others.
-
difference_update
()¶ Update the intbitset, removing elements found in others.
-
discard
()¶ Remove an element from a intbitset if it is a member. If the element is not a member, do nothing.
-
extract_finite_list
()¶ Return a finite list of elements sufficient to be passed to intbitset constructor toghether with the proper value of trailing_bits in order to reproduce this intbitset. At least up_to integer are looked for when they are inside the intbitset but not necessarily needed to build the intbitset
-
fastdump
()¶ Return a compressed string representation suitable to be saved somewhere.
-
fastload
()¶ Load a compressed string representation produced by a previous call to the fastdump method into the current intbitset. The previous content will be replaced.
-
get_allocated
()¶
-
get_size
()¶
-
get_wordbitsize
()¶
-
get_wordbytsize
()¶
-
intersection
()¶ Return a new intbitset with elements common to the intbitset and all others.
-
intersection_update
()¶ Update the intbitset, keeping only elements found in it and all others.
-
is_infinite
()¶ Return True if the intbitset is infinite. (i.e. trailing_bits=True was used in the constructor.)
-
isdisjoint
()¶ Return True if two intbitsets have a null intersection.
-
issubset
()¶ Report whether another set contains this set.
-
issuperset
()¶ Report whether this set contains another set.
-
pop
()¶ Remove and return an arbitrary set element.
- Note: intbitset implementation of .pop() differs from the native
set
- implementation by guaranteeing returning always the largest element.
- Note: intbitset implementation of .pop() differs from the native
-
remove
()¶ Remove an element from a set; it must be a member. If the element is not a member, raise a KeyError.
-
strbits
()¶ Return a string of 0s and 1s representing the content in memory of the intbitset.
-
symmetric_difference
¶ Return the symmetric difference of two sets as a new set. (i.e. all elements that are in exactly one of the sets.)
-
symmetric_difference_update
¶ Update an intbitset with the symmetric difference of itself and another.
-
tolist
()¶ Legacy method to retrieve a list of all the elements inside an intbitset.
-
union
()¶ Return a new intbitset with elements from the intbitset and all others.
-
union_update
()¶ Update the intbitset, adding elements from all others.
-
update
()¶ Update the intbitset, adding elements from all others.
-
update_with_signs
()¶ Given a dictionary rhs whose keys are integers, remove all the integers whose value are less than 0 and add every integer whose value is 0 or more
Indices and tables¶
Additional Notes¶
Notes on how to contribute, legal information and changelog are here for the interested.