Friday, May 1, 2015

Can a Python data structure reference itself?

By Vasudev Ram




As part of some Python work, I was tinkering with the built-in globals() function of Python, when I noticed this:
>>> g = globals()
>>> g
{'a': 'A', 'A': 'C', 'C': 4, 'b': 'a', 'g': {...}, '__builtins__': <
module '__builtin__' (built-in)>, 'k': '__doc__', '__package__': None, '__name__
': '__main__', '__doc__': None}
>>> # The variable g seems to have got stored inside of itself!
>>> id(g)
31298152L
>>> id(globals())
31298152L
>>> g['g'] == g
True
>>> g['g']['g'] == g
True
>>> g['g']['g']['g'] == g
True
>>> id(g['g']['g']['g']) == id(g)
True
The above does make a kind of sense, when you remember that the globals() function returns a dict representing the current global symbol table, i.e. all the currently defined attributes (objects) in the global scope.

So when I said g = globals(), the variable g got created in the global scope; so when I next print g, it should contain g itself, i.e. g (a list) contains itself as a list item (of g).

But on further thought, it seems like this should depend on the order of the evaluation of the statement g = globals():

Case 1) If the variable g is created first (and hence the global symbol table now contains it), and if only after that is the globals() function called, and its return value assigned to g, then things should work as shown in the above interpreter output.

Case 2) But if the evaluation works in a different order, i.e. the globals() function is first called (before the variable g is created), then at this point its return value (the dict) should not contain the item with key 'g' (and value g), and it is this dict that should get assigned to the variable g. Hence when we print g, we should not see g again within it.

Considering the above output, it seems like Case 1 is what actually happens. Also, I realized that if the globals() function is returning a copy of the dict that represents the global state, the above output should not work. But it seems to be returning the original dict itself, as shown by this:
>>> id(g) == id(globals())
True
To explore this further, I thought of trying to create a similar self-referential structure, but without using globals():
lis = []
lis.append(1)
print 'lis:', lis
print 'id(lis):', id(lis)
lis.append(lis)
print 'lis:', lis
print 'id(lis):', id(lis)
print 'id(lis[1]):', id(lis[1])
print lis == lis[1]
print lis[1] == lis[1][1]
And the output again seems to indicate that the list lis now references itself, that is, contains itself as an item.

I then thought of testing my Python function to flatten an arbitrarily nested list, with the above definition of lis, as the input argument. As expected, it terminated with a run-time error about recursion limit exceeded. See the post linked in the previous sentence for the flatten function - and some variations on it, in the comments on that post.

I'll write more about this in a following post, after fixing the flatten function to handle this case. On first thought, the fix seems straightforward: at any level in the recursion, check if id(lis) == id(lis[i]) - for any i, and if so terminate with an error about this being a self-referential list - but I'll write and test it first, then blog the results.

The more interesting question is whether this is a known Python feature, or an undocumented one (seems unlikely), or a bug (unlikely), and also whether similar behavior exists in other languages. Interested to hear what others think / know about this.

Update:

I looked in the official Python Language Reference, at the Section 3. Data model:

https://docs.python.org/2/reference/datamodel.html

and saw this excerpt:

[ CPython implementation detail: CPython currently uses a reference-counting scheme with (optional) delayed detection of cyclically linked garbage, which collects most objects as soon as they become unreachable, but is not guaranteed to collect garbage containing circular references. ]

Not sure whether it is relevant to the topic at hand, since, on the one hand, it uses the words "cyclically linked", but on the other, it says "garbage collection".

The image at the top of the post is of an ouroboros, which is "ancient symbol depicting a serpent or dragon eating its own tail", according to Wikipedia. Sort of appropriate, I guess, because Python :)

- Vasudev Ram - Online Python and Linux training and programming

Dancing Bison Enterprises

Signup to hear about new products or services that I create.

Posts about Python  Posts about xtopdf

Contact Page

8 comments:

Anonymous said...

A few comments: it's a perfectly okay feature. If you're constructing an object graph rather than a tree, you'll end up with self-references in one way or the other.

Note that pickle can deal with circular references (it uses an internal memoization table to track objects). It can serialize a graph and restore it.

It's not enough to check id(x)==id(x[i]) -- the circular reference can occur on any level.

Vasudev Ram said...

Thanks for your comments. Useful.

Regarding:

>It's not enough to check id(x)==id(x[i]) -- the circular reference can occur on any level.

You're right about that. I did not word this part of my post very clearly:
----------------------------------
On first thought, the fix seems straightforward: at any level in the recursion, check if id(lis) == id(lis[i]) - for any i, and if so terminate with an error
----------------------------------
By it, I meant that id(lis) will have to be checked for equality against all list elements that lis contains, not just for one value of i, and also, which I guess you implied, not just for a list being equal to any element it directly contains, such as lis == lis[i], but also check for possible equality over multiple levels of indirection, such as lis == lis[i]j[k].

Which could potentially make it a computationally expensive process, I guess, even a combinatorial explosion, depending on the structure of the list.

Vasudev Ram said...

>lis == lis[i]j[k]

Sorry, that should have been:

lis == lis[i][j][k]

Anonymous said...

Python passes complex data around by reference. So, when you do


g = globals()


g gets a reference to the global dictionary, not a copy.

>>> g = globals()
>>> 'g' in g
True

>>> 'h' in g
False

>>> h = 123
>>> 'h' in g
True

In [8]:

Vasudev Ram said...

I checked what you wrote. Correct. Good and short explanation. Thanks!

Vasudev Ram said...

I put a link to this post on the Usenet newsgroup comp.lang.python (via Google Groups):

Inner workings of this Python feature: Can a Python data structure reference itself?

There are some more interesting replies in that thread, which delve deeper into the topic.

Thanks to Terry Reedy, who pointed out (on comp.lang.python) a mistake in my post:

"g (a list) contains itself as a list item (of g)"

The correction: g is a dict, not a list, and contains itself as a dict item.

And he also made another few points in the same comment, which you can read on the thread.

Veky said...

I must say I don't get it... what's so strange here? This is a trivial consequence of Python's containers storing their elements' identities, not values.

>>> import types
>>> class Person(types.SimpleNamespace):
def __init__(self, name): self.name, self.friends = name, []
>>> vas, veky = Person('Vasudev'), Person('Veky')
>>> vas.friends.append(veky)
>>> veky.friends.append(vas)

There, Veky and Vasudev became friends. Is there _any_ reason why this should be forbidden?

Vasudev Ram said...

@Veky: The previous commenters have already explained the reasons why this works the way it does. But thanks for your example. Maybe my originally being surprised about it was because it may work differently in some other languages (which I need to check) - don't remember now, since this post was done a while ago.