Baron

Baron is a Full Syntax Tree (FST) for Python. It represents source code as a structured tree, easily parsable by a computer. By opposition to an Abstract Syntax Tree (AST) which drops syntax information in the process of its creation (like empty lines, comments, formatting), a FST keeps everything and guarantees the operation fst_to_code(code_to_fst(source_code)) == source_code.

Installation

pip install baron

Github (source, bug tracker, etc.)

https://github.com/psycojoker/baron

RedBaron

There is a good chance that you’ll want to use RedBaron instead of using Baron directly. Think of Baron as the “bytecode of python source code” and RedBaron as some sort of usable layer on top of it, a bit like dom/jQuery or html/Beautifulsoup.

Basic Usage

Baron provides two main functions:

  • parse to transform a string into Baron’s FST;
  • dumps to transform the FST back into a string.
In [1]: from baron import parse, dumps

In [2]: source_code = "a = 1"

In [3]: fst = parse(source_code)

In [4]: generated_source_code = dumps(fst)

In [5]: generated_source_code
Out[5]: 'a = 1'

In [6]: source_code == generated_source_code
Out[6]: True

Like said in the introduction, the FST keeps the formatting unlike ASTs. Here the following 3 codes are equivalent but their formatting is different. Baron keeps the difference so when dumping back the FST, all the formatting is respected:

In [7]: dumps(parse("a = 1"))
Out[7]: 'a = 1'

In [8]: dumps(parse("a=1"))
Out[8]: 'a=1'

In [9]: dumps(parse( "a   =   1"))
Out[9]: 'a   =   1'

Helpers

Baron also provides 3 helper functions show, show_file and show_node to explore the FST (in iPython for example). Those functions will print a formatted version of the FST so you can play with it to explore the FST and have an idea of what you are playing with.

Show

show is used directly on a string:

In [10]: from baron.helpers import show

In [11]: show("a = 1")
[
    {
        "first_formatting": [
            {
                "type": "space", 
                "value": " "
            }
        ], 
        "target": {
            "type": "name", 
            "value": "a"
        }, 
        "value": {
            "section": "number", 
            "type": "int", 
            "value": "1"
        }, 
        "second_formatting": [
            {
                "type": "space", 
                "value": " "
            }
        ], 
        "operator": null, 
        "type": "assignment"
    }
]

In [12]: show("a +=  b")
[
    {
        "first_formatting": [
            {
                "type": "space", 
                "value": " "
            }
        ], 
        "target": {
            "type": "name", 
            "value": "a"
        }, 
        "value": {
            "type": "name", 
            "value": "b"
        }, 
        "second_formatting": [
            {
                "type": "space", 
                "value": "  "
            }
        ], 
        "operator": "+", 
        "type": "assignment"
    }
]

Show_file

show_file is used on a file path:

from baron.helpers import show_file

show_file("/path/to/a/file")

Show_node

show_node is used on an already parsed string:

In [13]: from baron.helpers import show_node

In [14]: fst = parse("a = 1")

In [15]: show_node(fst)
[
    {
        "first_formatting": [
            {
                "type": "space", 
                "value": " "
            }
        ], 
        "target": {
            "type": "name", 
            "value": "a"
        }, 
        "value": {
            "section": "number", 
            "type": "int", 
            "value": "1"
        }, 
        "second_formatting": [
            {
                "type": "space", 
                "value": " "
            }
        ], 
        "operator": null, 
        "type": "assignment"
    }
]

Under the hood, the FST is serialized into JSON so the helpers are simply encapsulating JSON pretty printers.

Rendering the FST

Baron renders the FST back into source code by following the instructions given by the nodes_rendering_order dictionary. It gives, for everynode FST node, the order in which the node must be rendered.

In [16]: from baron import nodes_rendering_order

In [17]: from baron.helpers import show_node

In [18]: nodes_rendering_order["name"]
Out[18]: [('key', 'value', True)]

In [19]: show_node(parse("a_name")[0])
{
    "type": "name", 
    "value": "a_name"
}

In [20]: nodes_rendering_order["tuple"]
Out[20]: 
[('formatting', 'first_formatting', 'with_parenthesis'),
 ('constant', '(', 'with_parenthesis'),
 ('formatting', 'second_formatting', 'with_parenthesis'),
 ('list', 'value', True),
 ('formatting', 'third_formatting', 'with_parenthesis'),
 ('constant', ')', 'with_parenthesis'),
 ('formatting', 'fourth_formatting', 'with_parenthesis')]

In [21]: show_node(parse("(a_name,another_name,yet_another_name)")[0])
{
    "first_formatting": [], 
    "fourth_formatting": [], 
    "value": [
        {
            "type": "name", 
            "value": "a_name"
        }, 
        {
            "first_formatting": [], 
            "type": "comma", 
            "second_formatting": []
        }, 
        {
            "type": "name", 
            "value": "another_name"
        }, 
        {
            "first_formatting": [], 
            "type": "comma", 
            "second_formatting": []
        }, 
        {
            "type": "name", 
            "value": "yet_another_name"
        }
    ], 
    "second_formatting": [], 
    "third_formatting": [], 
    "with_parenthesis": true, 
    "type": "tuple"
}

In [22]: nodes_rendering_order["comma"]
Out[22]: 
[('formatting', 'first_formatting', True),
 ('constant', ',', True),
 ('formatting', 'second_formatting', True)]

For a “name” node, it is a list containing an unique tuple but it can contain multiple ones like for a “tuple” node.

To render a node, you just need to render each element of the list one by one in the given order. As you can see, they are all formatted as a 3-tuple. The first column is the type which is one of the following:

In [23]: from baron.render import node_types

In [24]: node_types
Out[24]: {'constant', 'formatting', 'key', 'list', 'node'}

Apart for the “constant” node, the second column contains the key of the FST node which must be rendered. The first column explains how that key must be rendered. We’ll see the third column later.

  • A node node is one of the nodes in the nodes_rendering_order we just introduced, it is rendered by following the rules mentionned here. This is indeed a recursive definition.
  • A key node is either a branch of the tree if the corresponding FST node’s key contains another node or a leaf if it contains a string. In the former case, it is rendered by rendering its content. In the latter, the string is outputted directly.
  • A list node is like a key node but can contain 0, 1 or several other nodes. For example, Baron root node is a list node since a python program is a list of statements. It is rendered by rendering each of its elements in order.
  • A formatting node is similar in behaviour to a list node but contains only formatting nodes. This is basically where Baron distinguish itself from ASTs.
  • A constant node is a leaf of the FST tree. The second column always contains a string which is outputted directly. Compared to a key node containing a string, the constant node is identical for every instance of the nodes (e.g. the left parenthesis character ( in a function call node of the def keyword of a function definition) while the key node’s value can change (e.g. the name of the function in a function call node).

Walktrough

Let’s see all this is in action by rendering a “lambda” node. First, the root node is always a “list” node and since we are only parsing one statement, the root node contains our “lambda” node at index 0:

In [25]: fst = parse("lambda x, y = 1: x + y")

In [26]: fst[0]["type"]
Out[26]: 'lambda'

For reference, you can find the (long) FST produced by the lambda node at the end of this section.

Now, let’s see how to render a “lambda” node:

In [27]: nodes_rendering_order["lambda"]
Out[27]: 
[('constant', 'lambda', True),
 ('formatting', 'first_formatting', True),
 ('list', 'arguments', True),
 ('formatting', 'second_formatting', True),
 ('constant', ':', True),
 ('formatting', 'third_formatting', True),
 ('key', 'value', True)]

Okay, first the string constant “lambda”, then a first_formatting node which represents the space between the string “lambda” and the variable “x”.

In [28]: fst[0]["first_formatting"]
Out[28]: [{'type': 'space', 'value': ' '}]

The “first_formatting” contains a list whose unique element is a “space” node.

In [29]: fst[0]["first_formatting"][0]
Out[29]: {'type': 'space', 'value': ' '}

In [30]: nodes_rendering_order["space"]
Out[30]: [('key', 'value', True)]

Which in turn is rendered by looking at the value key of the space node. It’s a string so it is outputted directly.

In [31]: fst[0]["first_formatting"][0]["value"]
Out[31]: ' '

So far we have outputted “lambda ”. Tedious but exhaustive.

We have exhausted the “first_formatting” node so we go back up the tree. Next is the “list” node representing the arguments:

In [32]: fst[0]["arguments"]
Out[32]: 
[{'first_formatting': [],
  'name': 'x',
  'second_formatting': [],
  'type': 'def_argument',
  'value': {}},
 {'first_formatting': [],
  'second_formatting': [{'type': 'space', 'value': ' '}],
  'type': 'comma'},
 {'first_formatting': [{'type': 'space', 'value': ' '}],
  'name': 'y',
  'second_formatting': [{'type': 'space', 'value': ' '}],
  'type': 'def_argument',
  'value': {'section': 'number', 'type': 'int', 'value': '1'}}]

Rendering a “list” node is done one element at a time. First a “def_argument”, then a “comma” and again a “def_argument”.

In [33]: fst[0]["arguments"][0]
Out[33]: 
{'first_formatting': [],
 'name': 'x',
 'second_formatting': [],
 'type': 'def_argument',
 'value': {}}

In [34]: nodes_rendering_order["def_argument"]
Out[34]: 
[('key', 'name', True),
 ('formatting', 'first_formatting', 'value'),
 ('constant', '=', 'value'),
 ('formatting', 'second_formatting', 'value'),
 ('key', 'value', 'value')]

The first “def_argument” is rendered by first outputting the content of a name “key” node, which is string and thus outputted directly:

In [35]: fst[0]["arguments"][0]["name"]
Out[35]: 'x'

Now, we have outputted “lambda x”. At first glance we could say we should render the second element of the “def_argument” node but as we’ll see in the next section, it is not the case thanks to the third column of the tuple.

For reference, the FST of the lambda node:

In [36]: show_node(fst[0])
{
    "first_formatting": [
        {
            "type": "space", 
            "value": " "
        }
    ], 
    "value": {
        "first_formatting": [
            {
                "type": "space", 
                "value": " "
            }
        ], 
        "value": "+", 
        "second_formatting": [
            {
                "type": "space", 
                "value": " "
            }
        ], 
        "second": {
            "type": "name", 
            "value": "y"
        }, 
        "type": "binary_operator", 
        "first": {
            "type": "name", 
            "value": "x"
        }
    }, 
    "second_formatting": [], 
    "third_formatting": [
        {
            "type": "space", 
            "value": " "
        }
    ], 
    "arguments": [
        {
            "first_formatting": [], 
            "type": "def_argument", 
            "name": "x", 
            "value": {}, 
            "second_formatting": []
        }, 
        {
            "first_formatting": [], 
            "type": "comma", 
            "second_formatting": [
                {
                    "type": "space", 
                    "value": " "
                }
            ]
        }, 
        {
            "first_formatting": [
                {
                    "type": "space", 
                    "value": " "
                }
            ], 
            "type": "def_argument", 
            "name": "y", 
            "value": {
                "section": "number", 
                "type": "int", 
                "value": "1"
            }, 
            "second_formatting": [
                {
                    "type": "space", 
                    "value": " "
                }
            ]
        }
    ], 
    "type": "lambda"
}

Dependent rendering

Sometimes, some node elements must not be outputted. In our “def_argument” example, all but the first are conditional. They are only rendered if the FST’s “value” node exists and is not empty. Let’s compare the two “def_arguments” FST nodes:

In [37]: fst[0]["arguments"][0]
Out[37]: 
{'first_formatting': [],
 'name': 'x',
 'second_formatting': [],
 'type': 'def_argument',
 'value': {}}

In [38]: fst[0]["arguments"][2]
Out[38]: 
{'first_formatting': [{'type': 'space', 'value': ' '}],
 'name': 'y',
 'second_formatting': [{'type': 'space', 'value': ' '}],
 'type': 'def_argument',
 'value': {'section': 'number', 'type': 'int', 'value': '1'}}

In [39]: nodes_rendering_order[fst[0]["arguments"][2]["type"]]
Out[39]: 
[('key', 'name', True),
 ('formatting', 'first_formatting', 'value'),
 ('constant', '=', 'value'),
 ('formatting', 'second_formatting', 'value'),
 ('key', 'value', 'value')]

The “value” is empty for the former “def_argument” but not for the latter because only the latter has a default assignment “= 1”.

In [40]: dumps(fst[0]["arguments"][0])
Out[40]: 'x'

In [41]: dumps(fst[0]["arguments"][2])
Out[41]: 'y = 1'

We will conclude here now that we have seen an example of every aspect of FST rendering. Understanding everything is not required to use Baron since dumps handles all the complexity.

Locate a Node

Since Baron produces a tree, a path is sufficient to locate univocally a node in the tree. A common task where a path is involved is when translating a position in a file (a line and a column) into a node of the FST.

Baron provides 2 helper functions for that: position_to_node and position_to_path. Both functions take a FST tree as first argument, then the line number and the column number. Line and column numbers start at 1, like in a text editor.

position_to_node returns an FST node. This is okay if you only want to know which node it is but not enough to locate the node in the tree. Indeed, there can be mutiple identical nodes within the tree.

That’s where position_to_path is useful. It returns a dictionary in JSON format which contains 3 values:

  • the path key contains the path: a list of int and strings which represent either the key to take in a Node or the index in a ListNode (e.g. “target”, “value”, 0)
  • the type key tells the type of the FST node (e.g. “function”, “assignment”, “class”)
  • the position_in_rendering_list key is the rendering position of the node compared to its parent node. This is especially needed when the character pointed on is actually not a node itself but only a part of a parent node. It’s a little complicated but don’t worry, examples will follow.

Let’s first see the difference between the two functions:

In [42]: from baron import parse

In [43]: from baron.finder import position_to_node, position_to_path
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-43-2ee5cf0b0d42> in <module>()
----> 1 from baron.finder import position_to_node, position_to_path

ImportError: No module named finder

In [44]: some_code = """\
   ....: from baron import parse
   ....: from baron.helpers import show_node
   ....: fst = parse("a = 1")
   ....: show_node(fst)
   ....: """
   ....: 

In [45]: tree = parse(some_code)

In [46]: node = position_to_node(tree, 3, 8)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-46-6fcd4b70f12a> in <module>()
----> 1 node = position_to_node(tree, 3, 8)

NameError: name 'position_to_node' is not defined

In [47]: show_node(node)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-47-478762d152dd> in <module>()
----> 1 show_node(node)

NameError: name 'node' is not defined

In [48]: path = position_to_path(tree, 3, 8)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-48-45ebd8ca0401> in <module>()
----> 1 path = position_to_path(tree, 3, 8)

NameError: name 'position_to_path' is not defined

In [49]: path
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-49-e7fa32cb05ba> in <module>()
----> 1 path

NameError: name 'path' is not defined

The first one gives the node and the second one the node path. Both also give its type but what does the keys in the path correspond to exactly? The path tells you that to get to the node, you must take the 4th index of the root ListNode, followed twice by the “value” key of first the “assignment” Node and next the “atomtrailers” Node. Finally, take the 0th index in the resulting ListNode:

In [50]: show_node(tree[4]["value"]["value"][0])
{
    "type": "name", 
    "value": "parse"
}

Neat. This is so common that there is a function to do that:

In [51]: from baron.finder import path_to_node
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-51-7499ebbbb11a> in <module>()
----> 1 from baron.finder import path_to_node

ImportError: No module named finder

In [52]: show_node(path_to_node(tree, path))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-52-d1f4f0a57623> in <module>()
----> 1 show_node(path_to_node(tree, path))

NameError: name 'path_to_node' is not defined

With the two above, that’s a total of three functions to locate a node.

And what about the position_in_rendering_list? To understand, the best is an example. What happens if you try to locate the node corresponding to the left parenthesis on line 3?

In [53]: position_to_path(tree, 3, 12)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-53-d1124224f778> in <module>()
----> 1 position_to_path(tree, 3, 12)

NameError: name 'position_to_path' is not defined

In [54]: show_node(tree[4]["value"]["value"][1])
{
    "first_formatting": [], 
    "fourth_formatting": [], 
    "value": [
        {
            "first_formatting": [], 
            "type": "call_argument", 
            "name": {}, 
            "value": {
                "first_formatting": [], 
                "type": "string", 
                "value": "\"a = 1\"", 
                "second_formatting": []
            }, 
            "second_formatting": []
        }
    ], 
    "second_formatting": [], 
    "third_formatting": [], 
    "type": "call"
}

As you can see, the information given by the path is that I’m on a call node. No parenthesis in sight. That’s where the position_in_rendering_list proves useful. It tells you where you are located in the rendering dictionary:

In [55]: from baron import nodes_rendering_order

In [56]: nodes_rendering_order["call"]
Out[56]: 
[('formatting', 'first_formatting', True),
 ('constant', '(', True),
 ('formatting', 'second_formatting', True),
 ('key', 'value', True),
 ('formatting', 'third_formatting', True),
 ('constant', ')', True),
 ('formatting', 'fourth_formatting', True)]

In [57]: nodes_rendering_order["call"][1]
Out[57]: ('constant', '(', True)

Because the parenthesis is a constant, there is no specific node for the parenthesis. So the path can only go as far as the parent node, here “call”, and show you the position in the rendering dictionary.

For example, it allows you to distinguish the left and right parenthesis in a call.

In [58]: position_to_path(tree, 3, 20)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-58-4aae88753679> in <module>()
----> 1 position_to_path(tree, 3, 20)

NameError: name 'position_to_path' is not defined

In [59]: nodes_rendering_order["call"][5]
Out[59]: ('constant', ')', True)

To conclude this section, let’s look at a last example of path:

In [60]: from baron.finder import position_to_path
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-60-e5e6945dbeee> in <module>()
----> 1 from baron.finder import position_to_path

ImportError: No module named finder

In [61]: fst = parse("a(1)")

In [62]: position_to_path(fst, 1, 1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-62-a63380ebd3d9> in <module>()
----> 1 position_to_path(fst, 1, 1)

NameError: name 'position_to_path' is not defined

In [63]: position_to_path(fst, 1, 2)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-63-dac281e2e154> in <module>()
----> 1 position_to_path(fst, 1, 2)

NameError: name 'position_to_path' is not defined

In [64]: position_to_path(fst, 1, 3)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-64-89318ee1c241> in <module>()
----> 1 position_to_path(fst, 1, 3)

NameError: name 'position_to_path' is not defined

In [65]: position_to_path(fst, 1, 4)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-65-1802b6b7fbd2> in <module>()
----> 1 position_to_path(fst, 1, 4)

NameError: name 'position_to_path' is not defined

By the way, out of bound positions are handled gracefully:

In [66]: print(position_to_node(fst, -1, 1))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-66-4de3017656f3> in <module>()
----> 1 print(position_to_node(fst, -1, 1))

NameError: name 'position_to_node' is not defined

In [67]: print(position_to_node(fst, 1, 0))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-67-853480996a4b> in <module>()
----> 1 print(position_to_node(fst, 1, 0))

NameError: name 'position_to_node' is not defined

In [68]: print(position_to_node(fst, 1, 5))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-68-b11a65655afb> in <module>()
----> 1 print(position_to_node(fst, 1, 5))

NameError: name 'position_to_node' is not defined

In [69]: print(position_to_node(fst, 2, 4))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-69-df6dcdea554f> in <module>()
----> 1 print(position_to_node(fst, 2, 4))

NameError: name 'position_to_node' is not defined

RenderWalker

Internally, Baron uses a walker to traverse a FST tree, it’s a generic class that you are free to use. To do so, you inherit from it and overload the chosen methods. You then launch an instance using it’s walk method. Here is how the Dumper (called by the function dumps) is written using it:

In [70]: from baron.render import RenderWalker

In [71]: class Dumper(RenderWalker):
   ....:     """Usage: Dumper().dump(tree)"""
   ....:     def on_leaf(self, constant, pos, key):
   ....:         self.dump += constant
   ....:         return self.CONTINUE
   ....:     def dump(self, tree):
   ....:         self.dump = ''
   ....:         self.walk(tree)
   ....:         return self.dump
   ....: 

The available methods that you can overload are:

  • before_list called before encountering a list of nodes
  • after_list called after encountering a list of nodes
  • before_formatting called before encountering a formatting list
  • after_formatting called after encountering a formatting list
  • before_node called before encountering a node
  • after_node called after encountering a node
  • before_key called before encountering a key type entry
  • after_key called after encountering a key type entry
  • on_leaf called when encountering a leaf of the FST (can be a constant (like “def” in a function definition) or an actual value like the value a name node)

Every method has the same signature: (self, node, render_pos, render_key).