Thursday, April 28, 2016

Exploring sizes of data types in Python

By Vasudev Ram

I was doing some experiments in Python to see how much of various data types could fit into the memory of my machine. Things like creating successively larger lists of integers (ints), to see at what point it ran out of memory.

At one point, I got a MemoryError while trying to create a list of ints that I thought should fit into memory. Sample code:
>>> lis = range(10 ** 9)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError
After thinking a bit, I realized that the error was to be expected, since data types in dynamic languages such as Python tend to take more space than they do in static languages such as C, due to metadata, pre-allocation (for some types) and interpreter book-keeping overhead.

And I remembered the sys.getsizeof() function, which shows the number of bytes used by its argument. So I wrote this code to display the types and sizes of some commonly used types in Python:
from __future__ import print_function
import sys

# data_type_sizes_w_list_comp.py
# A program to show the sizes in bytes, of values of various 
# Python data types.`

# Author: Vasudev Ram
# Copyright 2016 Vasudev Ram - https://vasudevram.github.io

#class Foo:
class Foo(object):
    pass

def gen_func():
    yield 1

def setup_data():
    a_bool = bool(0)
    an_int = 0
    a_long = long(0)
    a_float = float(0)
    a_complex = complex(0, 0)
    a_str = ''
    a_tuple = ()
    a_list = []
    a_dict = {}
    a_set = set()
    an_iterator = iter([1, 2, 3])
    a_function = gen_func
    a_generator = gen_func()
    an_instance = Foo()

    data = (a_bool, an_int, a_long, a_float, a_complex,
        a_str, a_tuple, a_list, a_dict, a_set,
        an_iterator, a_function, a_generator, an_instance)
    return data

data = setup_data()

print("\nPython data type sizes:\n")

header = "{} {} {}".format(\
    "Data".center(10), "Type".center(15), "Length".center(10))
print(header)
print('-' * 40)

rows = [ "{} {} {}".format(\
    repr(item).center(10), str(type(item)).center(15), \
    str(sys.getsizeof(item)).center(10)) for item in data[:-4] ]
print('\n'.join(rows))
print('-' * 70)

rows = [ "{} {} {}".format(\
    repr(item).center(10), str(type(item)).center(15), \
    str(sys.getsizeof(item)).center(10)) for item in data[-4:] ]
print('\n'.join(rows))
print('-' * 70)
(I broke out the last 4 objects above into a separate section/table, since the output for them is wider than for the ones above them.)

Although iterators, functions, generators and instances (of classes) are not traditionally considered as data types, I included them as well, since they are all objects (see: almost everything in Python is an object), so they are data in a sense too, at least in the sense that programs can manipulate them. And while one is not likely to create tens of thousands or more of objects of these types (except maybe class instances [1]), it's interesting to have an idea of how much space instances of them take in memory.

[1] As an aside, if you have to create thousands of class instances, the flyweight design pattern might be of help.

Here is the output of running the program with:
$ python data_type_sizes.py

Python data type sizes:
----------------------------------------
   Data          Type        Length  
----------------------------------------
  False     <type 'bool'>      12    
    0        <type 'int'>      12    
    0L      <type 'long'>      12    
   0.0      <type 'float'>     16    
    0j     <type 'complex'>     24    
    ''       <type 'str'>      21    
    ()      <type 'tuple'>     28    
    []      <type 'list'>      36    
    {}      <type 'dict'>     140    
 set([])     <type 'set'>     116    
----------------------------------------------------------------------

----------------------------------------------------------------------
<listiterator object at 0x021F0FF0> <type 'listiterator'>     32    
<function gen_func at 0x021EBF30> <type 'function'>     60    
<generator object gen_func at 0x021F6C60> <type 'generator'>     40    
<__main__.Foo object at 0x022E6290> <class '__main__.Foo'>     32
----------------------------------------------------------------------

[ When I used the old-style Python class definition for Foo (see the comment near the class keyword in the code), the output for an_instance was this instead:
<__main__.Foo instance at 0x021F6C88> <type 'instance'> 36
So old-style class instances actually take 36 bytes vs. new-style ones taking 32.
]

We can draw a few deductions from the above output.

- bool is a subset of the int type, so takes the same space - 12 bytes.
- float takes a bit more space than long.
- complex takes even more.
- strings and the data types below it in the first table above, have a fair amount of overhead.

Finally, I first wrote the program with two for loops, then changed (and slightly shortened) it by using the two list comprehensions that you see above - hence the file name data_type_sizes_w_list_comp.py :)

- Enjoy.

- Vasudev Ram - Online Python training and consulting

Signup to hear about my new courses and products.

My Python posts     Subscribe to my blog by email

My ActiveState recipes

7 comments:

Vasudev Ram said...

Just to be clear:

>- strings and the data types below it in the first table above, have a fair amount of overhead.

ALL of the types shown have some overhead (as compared to the actual space required for their data alone. (I just meant that the strings and the later ones have some more.) But this is for reasons already mentioned in the post.

Also, the formatting - .center(), etc. - for the second table/section does not work, because the strings are longer than the widths given. Realized this later.

Vasudev Ram said...

Correction:

After updating the program to use list comps instead of for loops, the line that runs the program should now read:

$ python data_type_sizes_w_list_comp.py

Wan said...

I didn't receive the same result as you. I'm using Python 3.5.1 and I ran your code via IDLE. The line "a_long = long(0)" does not work for me. In fact, it produces an error. So I modified that line of code, set variable a_long equal to a 58 digit decimal number, and here is the resultant output.

Python data type sizes:

Data Type Length
----------------------------------------
False 12
0 12
123456789012345678901234567890123456789012345678901234567 38
0.0 16
0j 24
'' 25
() 28
[] 36
{} 148
set() 116
----------------------------------------------------------------------
32
72
48
<__main__.Foo object at 0x02D3FB50> 32
----------------------------------------------------------------------

My data types are classes, and the size of a specific class can be different. For example if you create two integer variables a=0 and b=1234567890, the sizes of both integer classes will be different. I believe these are the size of the objects themselves (not the size of the class), and that's why the sizes between the two integers can be different.

Vasudev Ram said...

@Wan: Thanks for your comment. I used Python 2.7 (.8 or .11). I guess some differences due to Python version are to be expected. About different classes or instances having different sizes, I'll have to check, but it is likely so, similar to struct or record types in any otherlanguage. This point did cross my mind while writing the post, but I did not mention it. Good observation.

Vasudev Ram said...

@Wan: And about the long(0) error, Python 3 does not have longs. There is only int, which is like the long of Python 2. See:


https://docs.python.org/3.0/whatsnew/3.0.html

in the section about PEP 0237,

and see PEP 237 itself:

https://www.python.org/dev/peps/pep-0237/






Matthew Phipps said...

Great post! I hadn't heard of the flyweight pattern; it sounds useful.

I do feel that the size you listed for user-defined class instances is a bit misleading. Unless you use __slots__, each instance gets its own dict at __dict__, which as you showed, adds at least 140 bytes. Of course, technically that's not part of the object, but fact remains it increases the memory requirements by maybe 5x.

And it's worth noting that most built-in classes like object, int, str, do not have __dict__, for this reason.

Vasudev Ram said...

@Matthew Phipps:

>Great post! I hadn't heard of the flyweight pattern; it sounds useful.

Thank you!

>I do feel that the size you listed for user-defined class instances is a bit misleading. Unless you use __slots__, each instance gets its own dict at __dict__, which as you showed, adds at least 140 bytes. Of course, technically that's not part of the object, but fact remains it increases the memory requirements by maybe 5x.

Thanks for pointing that out. I didn't intend it to misleading. I did know that instances have a __dict__, but did not remember to include it in the size calculation.

I would say you're right, it should be considered part of the object.

>And it's worth noting that most built-in classes like object, int, str, do not have __dict__, for this reason.

Good point ...