Edit on GitHub

Semantic Diff for SQL

Motivation

Software is constantly changing and evolving, and identifying what has changed and reviewing those changes is an integral part of the development process. SQL code is no exception to this.

Text-based diff tools such as git diff, when applied to a code base, have certain limitations. First, they can only detect insertions and deletions, not movements or updates of individual pieces of code. Second, such tools can only detect changes between lines of text, which is too coarse for something as granular and detailed as source code. Additionally, the outcome of such a diff is dependent on the underlying code formatting, and yields different results if the formatting should change.

Consider the following diff generated by Git:

Git diff output

Semantically the query hasn’t changed. The two arguments b and c have been swapped (moved), posing no impact on the output of the query. Yet Git replaced the whole affected expression alongside a bulk of unrelated elements.

The alternative to text-based diffing is to compare Abstract Syntax Trees (AST) instead. The main advantage of ASTs are that they are a direct product of code parsing, which represents the underlying code structure at any desired level of granularity. Comparing ASTs may yield extremely precise diffs; changes such as code movements and updates can also be detected. Even more importantly, this approach facilitates additional use cases beyond eyeballing two versions of source code side by side.

The use cases I had in mind for SQL when I decided to embark on this journey of semantic diffing were the following:

Query similarity score. Identifying which parts the two queries have in common to automatically suggest opportunities for consolidation, creation of intermediate/staging tables, and so on.
Differentiating between cosmetic / structural changes and functional ones. For example when a nested query is refactored into a common table expression (CTE), this kind of change doesn’t have any functional impact on either a query or its outcome.
Automatic suggestions about the need to retroactively backfill data. This is especially important for pipelines that populate very large tables for which restatement is a runtime-intensive procedure. The ability to discern between simple code movements and actual modifications can help assess the impact of a change and make suggestions accordingly.

The implementation discussed in this post is now a part of the SQLGlot library. You can find a complete source code in the diff.py module. The choice of SQLglot was an obvious one due to its simple but powerful API, lack of external dependencies and, more importantly, extensive list of supported SQL dialects.

The Search for a Solution

When it comes to any diffing tool (not just a semantic one), the primary challenge is to match as many elements of compared entities as possible. Once such a set of matching elements is available, deriving a sequence of changes becomes an easy task.

If our elements have unique identifiers associated with them (for example, an element’s ID in DOM), the matching problem is trivial. However, the SQL syntax trees that we are comparing have neither unique keys nor object identifiers that can be used for the purposes of matching. So, how do we suppose to find pairs of nodes that are related?

To better illustrate the problem, consider comparing the following SQL expressions: SELECT a + b + c, d, e and SELECT a - b + c, e, f. Matching individual nodes from respective syntax trees can be visualized as follows:

Figure 1: Example of node matching for two SQL expression trees.

By looking at the figure of node matching for two SQL expression trees above, we conclude that the following changes should be captured by our solution:

Inserted nodes: Sub and f. These are the nodes from the target AST which do not have a matching node in the source AST.
Removed nodes: Add and d. These are the nodes from the source AST which do not have a counterpart in the target AST.
Remaining nodes must be identified as unchanged.

It should be clear at this point that if we manage to match nodes in the source tree with their counterparts in the target tree, then computing the diff becomes a trivial matter.

Naïve Brute-Force

The naïve solution would be to try all different permutations of node pair combinations, and see which set of pairs performs the best based on some type of heuristics. The runtime cost of such a solution quickly reaches the escape velocity; if both trees had only 10 nodes each, the number of such sets would approximately be 10! ^ 2 = 3.6M ^ 2 ~= 13 * 10^12. This is a very bad case of factorial complexity (to be precise, it’s actually much worse - O(n! ^ 2) - but I couldn’t come up with a name for it), so there is little need to explore this approach any further.

Myers Algorithm

After the naïve approach was proven to be infeasible, the next question I asked myself was “how does git diff work?”. This question led me to discover the Myers diff algorithm [1]. This algorithm has been designed to compare sequences of strings. At its core, it’s looking for the shortest path on a graph of possible edits that transform the first sequence into the second one, while heavily rewarding those paths that lead to longest subsequences of unchanged elements. There’s a lot of material out there describing this algorithm in greater detail. I found James Coglan’s series of blog posts to be the most comprehensive.

Therefore, I had this “brilliant” (actually not) idea to transform trees into sequences by traversing them in topological order, and then applying the Myers algorithm on resulting sequences while using a custom heuristics when checking the equality of two nodes. Unsurprisingly, comparing sequences of strings is quite different from comparing hierarchical tree structures, and by flattening trees into sequences, we lose a lot of relevant context. This resulted in a terrible performance of this algorithm on ASTs. It often matched completely unrelated nodes, even when the two trees were mostly the same, and produced extremely inaccurate lists of changes overall. After playing around with it a little and tweaking my equality heuristics to improve accuracy, I ultimately scrapped the whole implementation and went back to the drawing board.

Change Distiller

The algorithm I settled on at the end was Change Distiller, created by Fluri et al. [2], which in turn is an improvement over the core idea described by Chawathe et al. [3].

The algorithm consists of two high-level steps:

Finding appropriate matchings between pairs of nodes that are part of compared ASTs. Identifying what is meant by “appropriate” matching is also a part of this step.
Generating the so-called “edit script” from the matching set built in the 1st step. The edit script is a sequence of edit operations (for example, insert, remove, update, etc.) on individual tree nodes, such that when applied as transformations on the source AST, it eventually becomes the target AST. In general, the shorter the sequence, the better. The length of the edit script can be used to compare the performance of different algorithms, though this is not the only metric that matters.

The rest of this section is dedicated to the Python implementation of the steps above using the AST implementation provided by the SQLGlot library.

Building the Matching Set

Matching Leaves

We begin composing the matching set by matching the leaf nodes. Leaf nodes are the nodes that do not have any children nodes (such as literals, identifiers, etc.). In order to match them, we gather all the leaf nodes from the source tree and generate a cartesian product with all the leaves from the target tree, while comparing pairs created this way and assigning them a similarity score. During this stage, we also exclude pairs that don’t pass basic matching criteria. Then, we pick pairs that scored the highest while making sure that each node is matched no more than once.

Using the example provided at the beginning of the post, the process of building an initial set of candidate matchings can be seen on Figure 2.

Figure 2: Building a set of candidate matchings between leaf nodes. The third item in each triplet represents a similarity score between two nodes.

First, let’s analyze the similarity score. Then, we’ll discuss matching criteria.

The similarity score proposed by Fluri et al. [2] is a dice coefficient applied to bigrams of respective node values. A bigram is a sequence of two adjacent elements from a string computed in a sliding window fashion:

def bigram(string):
    count = max(0, len(string) - 1)
    return [string[i : i + 2] for i in range(count)]

For reasons that will become clear shortly, we actually need to compute bigram histograms rather than just sequences:

from collections import defaultdict

def bigram_histo(string):
    count = max(0, len(string) - 1)
    bigram_histo = defaultdict(int)
    for i in range(count):
        bigram_histo[string[i : i + 2]] += 1
    return bigram_histo

The dice coefficient formula looks like following:

Dice Coefficient

Where X is a bigram of the source node and Y is a bigram of the second one. What this essentially does is count the number of bigram elements the two nodes have in common, multiply it by 2, and then divide by the total number of elements in both bigrams. This is where bigram histograms come in handy:

def dice_coefficient(source, target):
    source_histo = bigram_histo(source.sql())
    target_histo = bigram_histo(target.sql())

    total_grams = (
        sum(source_histo.values()) + sum(target_histo.values())
    )
    if not total_grams:
        return 1.0 if source == target else 0.0

    overlap_len = 0
    overlapping_grams = set(source_histo) & set(target_histo)
    for g in overlapping_grams:
        overlap_len += min(source_histo[g], target_histo[g])

    return 2 * overlap_len / total_grams

To compute a bigram given a tree node, we first transform the node into its canonical SQL representation,so that the Literal(123) node becomes just “123” and the Identifier(“a”) node becomes just “a”. We also handle a scenario when strings are too short to derive bigrams. In this case, we fallback to checking the two nodes for equality.

Now when we know how to compute the similarity score, we can take care of the matching criteria for leaf nodes. In the original paper [2], the matching criteria is formalized as follows:

Matching criteria for leaf nodes

The two nodes are matched if two conditions are met:

The node labels match (in our case labels are just node types).
The similarity score for node values is greater than or equal to some threshold “f”. The authors of the paper recommend setting the value of “f” to 0.6.

With building blocks in place, we can now build a matching set for leaf nodes. First, we generate a list of candidates for matching:

from heapq import heappush, heappop

candidate_matchings = []
source_leaves = _get_leaves(self._source)
target_leaves = _get_leaves(self._target)
for source_leaf in source_leaves:
    for target_leaf in target_leaves:
        if _is_same_type(source_leaf, target_leaf):
            similarity_score = dice_coefficient(
                source_leaf, target_leaf
            )
            if similarity_score >= 0.6:
                heappush(
                    candidate_matchings,
                    (
                        -similarity_score,
                        len(candidate_matchings),
                        source_leaf,
                        target_leaf,
                    ),
                )

In the implementation above, we push each matching pair onto the heap to automatically maintain the correct order based on the assigned similarity score.

Finally, we build the initial matching set by picking leaf pairs with the highest score:

matching_set = set()
while candidate_matchings:
    _, _, source_leaf, target_leaf = heappop(candidate_matchings)
    if (
        source_leaf in unmatched_source_nodes
        and target_leaf in unmatched_target_nodes
    ):
        matching_set.add((source_leaf, target_leaf))
        unmatched_source_nodes.remove(source_leaf)
        unmatched_target_nodes.remove(target_leaf)

To finalize the matching set, we should now proceed with matching inner nodes.

Matching Inner Nodes

Matching inner nodes is quite similar to matching leaf nodes, with the following two distinctions:

Rather than ranking a set of possible candidates, we pick the first node pair that passes the matching criteria.
The matching criteria itself has been extended to account for the number of leaf nodes the pair of inner nodes have in common.

Figure 3: Matching inner nodes based on their type as well as how many of their leaf nodes have been previously matched.

Let’s start with the matching criteria. The criteria is formalized as follows:

Matching criteria for inner nodes

Alongside already familiar similarity score and node type criteria, there is a new one in the middle: the ratio of leaf nodes that the two nodes have in common must exceed some threshold “t”. The recommended value for “t” is also 0.6. Counting the number of common leaf nodes is pretty straightforward, since we already have the complete matching set for leaves. All we need to do is count how many matching pairs do leaf nodes from the two compared inner nodes form.

There are two additional heuristics associated with this matching criteria:

Inner node similarity weighting: if the similarity score between the node values doesn’t pass the threshold “f” but the ratio of common leaf nodes (“t”) is greater than or equal to 0.8, then the matching is considered successful.
The threshold “t” is reduced to 0.4 for inner nodes with the number of leaf nodes equal to 4 or less, in order to decrease the false negative rate for small subtrees.

We now only have to iterate through the remaining unmatched nodes and form matching pairs based on the outlined criteria:

leaves_matching_set = matching_set.copy()

for source_node in unmatched_source_nodes.copy():
    for target_node in unmatched_target_nodes:
        if _is_same_type(source_node, target_node):
            source_leaves = set(_get_leaves(source_node))
            target_leaves = set(_get_leaves(target_node))

            max_leaves_num = max(len(source_leaves), len(target_leaves))
            if max_leaves_num:
                common_leaves_num = sum(
                    1 if s in source_leaves and t in target_leaves else 0
                    for s, t in leaves_matching_set
                )
                leaf_similarity_score = common_leaves_num / max_leaves_num
            else:
                leaf_similarity_score = 0.0

            adjusted_t = (
                0.6
                if min(len(source_leaves), len(target_leaves)) > 4
                else 0.4
            )

            if leaf_similarity_score >= 0.8 or (
                leaf_similarity_score >= adjusted_t
                and dice_coefficient(source_node, target_node) >= 0.6
            ):
                matching_set.add((source_node, target_node))
                unmatched_source_nodes.remove(source_node)
                unmatched_target_nodes.remove(target_node)
                break

After the matching set is formed, we can proceed with generation of the edit script, which will be the algorithm’s output.

Generating the Edit Script

At this point, we should have the following 3 sets at our disposal:

The set of matched node pairs.
The set of remaining unmatched nodes from the source tree.
The set of remaining unmatched nodes from the target tree.

We can derive 3 kinds of edits from the matching set: either the node’s value was updated (Update), the node was moved to a different position within the tree (Move), or the node remained unchanged (Keep). Note that the Move case is not mutually exclusive with the other two. The node could have been updated or could have remained the same while at the same time its position within its parent node or the parent node itself could have changed. All unmatched nodes from the source tree are the ones that were removed (Remove), while unmatched nodes from the target tree are the ones that were inserted (Insert).

The latter two cases are pretty straightforward to implement:

edit_script = []

for removed_node in unmatched_source_nodes:
    edit_script.append(Remove(removed_node))
for inserted_node in unmatched_target_nodes:
    edit_script.append(Insert(inserted_node))

Traversing the matching set requires a little more thought:

for source_node, target_node in matching_set:
    if (
        not isinstance(source_node, LEAF_EXPRESSION_TYPES)
        or source_node == target_node
    ):
        move_edits = generate_move_edits(
            source_node, target_node, matching_set
        )
        edit_script.extend(move_edits)
        edit_script.append(Keep(source_node, target_node))
    else:
        edit_script.append(Update(source_node, target_node))

If a matching pair represents a pair of leaf nodes, we check if they are the same to decide whether an update took place. For inner node pairs, we also need to compare the positions of their respective children to detect node movements. Chawathe et al. [3] suggest applying the longest common subsequence (LCS) algorithm which, no surprise here, was described by Myers himself [1]. There is a small catch, however: instead of checking the equality of two children nodes, we need to check whether the two nodes form a pair that is a part of our matching set.

Now with this knowledge, the implementation becomes straightforward:

def generate_move_edits(source, target, matching_set):
    source_children = _get_child_nodes(source)
    target_children = _get_child_nodes(target)

    lcs = set(
        _longest_common_subsequence(
            source_children,
            target_children,
            lambda l, r: (l, r) in matching_set
        )
    )

    move_edits = []
    for node in source_children:
        if node not in lcs and node not in unmatched_source_nodes:
            move_edits.append(Move(node))

    return move_edits

I left out the implementation of the LCS algorithm itself here, but there are plenty of implementation choices out there that can be easily looked up.

Output

The implemented algorithm produces the output that resembles the following:

>>> from sqlglot import parse_one, diff
>>> diff(parse_one("SELECT a + b + c, d, e"), parse_one("SELECT a - b + c, e, f"))

Remove(Add)
Remove(Column(d))
Remove(Identifier(d))
Insert(Sub)
Insert(Column(f))
Insert(Identifier(f))
Keep(Select, Select)
Keep(Add, Add)
Keep(Column(a), Column(a))
Keep(Identifier(a), Identifier(a))
Keep(Column(b), Column(b))
Keep(Identifier(b), Identifier(b))
Keep(Column(c), Column(c))
Keep(Identifier(c), Identifier(c))
Keep(Column(e), Column(e))
Keep(Identifier(e), Identifier(e))

Note that the output above is abbreviated. The string representation of actual AST nodes is significantly more verbose.

The implementation works especially well when coupled with the SQLGlot’s query optimizer which can be used to produce canonical representations of compared queries:

>>> schema={"t": {"a": "INT", "b": "INT", "c": "INT", "d": "INT"}}
>>> source = """
... SELECT 1 + 1 + a
... FROM t
... WHERE b = 1 OR (c = 2 AND d = 3)
... """
>>> target = """
... SELECT 2 + a
... FROM t
... WHERE (b = 1 OR c = 2) AND (b = 1 OR d = 3)
... """
>>> optimized_source = optimize(parse_one(source), schema=schema)
>>> optimized_target = optimize(parse_one(target), schema=schema)
>>> edit_script = diff(optimized_source, optimized_target)
>>> sum(0 if isinstance(e, Keep) else 1 for e in edit_script)
0

Optimizations

The worst case runtime complexity of this algorithm is not exactly stellar: O(n^2 * log n^2). This is because of the leaf matching process, which involves ranking a cartesian product between all leaf nodes of compared trees. Unsurprisingly, the algorithm takes a considerable time to finish for bigger queries.

There are still a few basic things we can do in our implementation to help improve performance:

Refer to individual node objects using their identifiers (Python’s id()) instead of direct references in sets. This helps avoid costly recursive hash calculations and equality checks.
Cache bigram histograms to avoid computing them more than once for the same node.
Compute the canonical SQL string representation for each tree once while caching string representations of all inner nodes. This prevents redundant tree traversals when bigrams are computed.

At the time of writing only the first two optimizations have been implemented, so there is an opportunity to contribute for anyone who’s interested.

Alternative Solutions

This section is dedicated to solutions that I’ve investigated, but haven’t tried.

First, this section wouldn’t be complete without Tristan Hume’s blog post. Tristan’s solution has a lot in common with the Myers algorithm plus heuristics that is much more clever than what I came up with. The implementation relies on a combination of dynamic programming and A* search algorithm to explore the space of possible matchings and pick the best ones. It seemed to have worked well for Tistan’s specific use case, but after my negative experience with the Myers algorithm, I decided to try something different.

Another notable approach is the Gumtree algorithm by Falleri et al. [4]. I discovered this paper after I’d already implemented the algorithm that is the main focus of this post. In sections 5.2 and 5.3 of their paper, the authors compare the two algorithms side by side and claim that Gumtree is significantly better in terms of both runtime performance and accuracy when evaluated on 12 792 pairs of Java source files. This doesn’t surprise me, as the algorithm takes the height of subtrees into account. In my tests, I definitely saw scenarios in which this context would have helped. On top of that, the authors promise O(n^2) runtime complexity in the worst case which, given the Change Distiller's O(n^2 * log n^2), looks particularly tempting. I hope to try this algorithm out at some point, and there is a good chance you see me writing about it in my future posts.

Conclusion

The Change Distiller algorithm yielded quite satisfactory results in most of my tests. The scenarios in which it fell short mostly concerned identical (or very similar) subtrees located in different parts of the AST. In those cases, node mismatches were frequent and, as a result, edit scripts were somewhat suboptimal.

Additionally, the runtime performance of the algorithm leaves a lot to be desired. On trees with 1000 leaf nodes each, the algorithm takes a little under 2 seconds to complete. My implementation still has room for improvement, but this should give you a rough idea of what to expect. It appears that the Gumtree algorithm [4] can help address both of these points. I hope to find bandwidth to work on it soon and then compare the two algorithms side-by-side to find out which one performs better on SQL specifically. In the meantime, Change Distiller definitely gets the job done, and I can now proceed with applying it to some of the use cases I mentioned at the beginning of this post.

I’m also curious to learn whether other folks in the industry faced a similar problem, and how they approached it. If you did something similar, I’m interested to hear about your experience.

References

[1] Eugene W. Myers. An O(ND) Difference Algorithm and Its Variations. Algorithmica 1(2): 251-266 (1986)

[2] B. Fluri, M. Wursch, M. Pinzger, and H. Gall. Change Distilling: Tree differencing for fine-grained source code change extraction. IEEE Trans. Software Eng., 33(11):725–743, 2007.

[3] S.S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change Detection in Hierarchically Structured Information. Proc. ACM Sigmod Int’l Conf. Management of Data, pp. 493-504, June 1996

[4] Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, Martin Monperrus. Fine-grained and Accurate Source Code Differencing. Proceedings of the International Conference on Automated Software Engineering, 2014, Västeras, Sweden. pp.313-324, 10.1145/2642937.2642982. hal-01054552

View Source

  1"""
  2.. include:: ../posts/sql_diff.md
  3
  4----
  5"""
  6
  7from __future__ import annotations
  8
  9import typing as t
 10from collections import defaultdict
 11from dataclasses import dataclass
 12from heapq import heappop, heappush
 13from itertools import chain
 14
 15from sqlglot import Dialect, expressions as exp
 16from sqlglot.helper import seq_get
 17
 18if t.TYPE_CHECKING:
 19    from sqlglot.dialects.dialect import DialectType
 20
 21
 22@dataclass(frozen=True)
 23class Insert:
 24    """Indicates that a new node has been inserted"""
 25
 26    expression: exp.Expression
 27
 28
 29@dataclass(frozen=True)
 30class Remove:
 31    """Indicates that an existing node has been removed"""
 32
 33    expression: exp.Expression
 34
 35
 36@dataclass(frozen=True)
 37class Move:
 38    """Indicates that an existing node's position within the tree has changed"""
 39
 40    source: exp.Expression
 41    target: exp.Expression
 42
 43
 44@dataclass(frozen=True)
 45class Update:
 46    """Indicates that an existing node has been updated"""
 47
 48    source: exp.Expression
 49    target: exp.Expression
 50
 51
 52@dataclass(frozen=True)
 53class Keep:
 54    """Indicates that an existing node hasn't been changed"""
 55
 56    source: exp.Expression
 57    target: exp.Expression
 58
 59
 60if t.TYPE_CHECKING:
 61    from sqlglot._typing import T
 62
 63    Edit = t.Union[Insert, Remove, Move, Update, Keep]
 64
 65
 66def diff(
 67    source: exp.Expression,
 68    target: exp.Expression,
 69    matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None,
 70    delta_only: bool = False,
 71    **kwargs: t.Any,
 72) -> t.List[Edit]:
 73    """
 74    Returns the list of changes between the source and the target expressions.
 75
 76    Examples:
 77        >>> diff(parse_one("a + b"), parse_one("a + c"))
 78        [
 79            Remove(expression=(COLUMN this: (IDENTIFIER this: b, quoted: False))),
 80            Insert(expression=(COLUMN this: (IDENTIFIER this: c, quoted: False))),
 81            Keep(
 82                source=(ADD this: ...),
 83                target=(ADD this: ...)
 84            ),
 85            Keep(
 86                source=(COLUMN this: (IDENTIFIER this: a, quoted: False)),
 87                target=(COLUMN this: (IDENTIFIER this: a, quoted: False))
 88            ),
 89        ]
 90
 91    Args:
 92        source: the source expression.
 93        target: the target expression against which the diff should be calculated.
 94        matchings: the list of pre-matched node pairs which is used to help the algorithm's
 95            heuristics produce better results for subtrees that are known by a caller to be matching.
 96            Note: expression references in this list must refer to the same node objects that are
 97            referenced in the source / target trees.
 98        delta_only: excludes all `Keep` nodes from the diff.
 99        kwargs: additional arguments to pass to the ChangeDistiller instance.
100
101    Returns:
102        the list of Insert, Remove, Move, Update and Keep objects for each node in the source and the
103        target expression trees. This list represents a sequence of steps needed to transform the source
104        expression tree into the target one.
105    """
106    matchings = matchings or []
107
108    def compute_node_mappings(
109        old_nodes: tuple[exp.Expression, ...], new_nodes: tuple[exp.Expression, ...]
110    ) -> t.Dict[int, exp.Expression]:
111        node_mapping = {}
112        for old_node, new_node in zip(reversed(old_nodes), reversed(new_nodes)):
113            new_node._hash = hash(new_node)
114            node_mapping[id(old_node)] = new_node
115
116        return node_mapping
117
118    # if the source and target have any shared objects, that means there's an issue with the ast
119    # the algorithm won't work because the parent / hierarchies will be inaccurate
120    source_nodes = tuple(source.walk())
121    target_nodes = tuple(target.walk())
122    source_ids = {id(n) for n in source_nodes}
123    target_ids = {id(n) for n in target_nodes}
124
125    copy = (
126        len(source_nodes) != len(source_ids)
127        or len(target_nodes) != len(target_ids)
128        or source_ids & target_ids
129    )
130
131    source_copy = source.copy() if copy else source
132    target_copy = target.copy() if copy else target
133
134    try:
135        # We cache the hash of each new node here to speed up equality comparisons. If the input
136        # trees aren't copied, these hashes will be evicted before returning the edit script.
137        if copy and matchings:
138            source_mapping = compute_node_mappings(source_nodes, tuple(source_copy.walk()))
139            target_mapping = compute_node_mappings(target_nodes, tuple(target_copy.walk()))
140            matchings = [(source_mapping[id(s)], target_mapping[id(t)]) for s, t in matchings]
141        else:
142            for node in chain(reversed(source_nodes), reversed(target_nodes)):
143                node._hash = hash(node)
144
145        edit_script = ChangeDistiller(**kwargs).diff(
146            source_copy,
147            target_copy,
148            matchings=matchings,
149            delta_only=delta_only,
150        )
151    finally:
152        if not copy:
153            for node in chain(source_nodes, target_nodes):
154                node._hash = None
155
156    return edit_script
157
158
159# The expression types for which Update edits are allowed.
160UPDATABLE_EXPRESSION_TYPES = (
161    exp.Alias,
162    exp.Boolean,
163    exp.Column,
164    exp.DataType,
165    exp.Lambda,
166    exp.Literal,
167    exp.Table,
168    exp.Window,
169)
170
171IGNORED_LEAF_EXPRESSION_TYPES = (exp.Identifier,)
172
173
174class ChangeDistiller:
175    """
176    The implementation of the Change Distiller algorithm described by Beat Fluri and Martin Pinzger in
177    their paper https://ieeexplore.ieee.org/document/4339230, which in turn is based on the algorithm by
178    Chawathe et al. described in http://ilpubs.stanford.edu:8090/115/1/1995-46.pdf.
179    """
180
181    def __init__(self, f: float = 0.6, t: float = 0.6, dialect: DialectType = None) -> None:
182        self.f = f
183        self.t = t
184        self._sql_generator = Dialect.get_or_raise(dialect).generator()
185
186    def diff(
187        self,
188        source: exp.Expression,
189        target: exp.Expression,
190        matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None,
191        delta_only: bool = False,
192    ) -> t.List[Edit]:
193        matchings = matchings or []
194        pre_matched_nodes = {id(s): id(t) for s, t in matchings}
195
196        self._source = source
197        self._target = target
198        self._source_index = {
199            id(n): n for n in self._source.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES)
200        }
201        self._target_index = {
202            id(n): n for n in self._target.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES)
203        }
204        self._unmatched_source_nodes = set(self._source_index) - set(pre_matched_nodes)
205        self._unmatched_target_nodes = set(self._target_index) - set(pre_matched_nodes.values())
206        self._bigram_histo_cache: t.Dict[int, t.DefaultDict[str, int]] = {}
207
208        matching_set = self._compute_matching_set() | set(pre_matched_nodes.items())
209        return self._generate_edit_script(dict(matching_set), delta_only)
210
211    def _generate_edit_script(self, matchings: t.Dict[int, int], delta_only: bool) -> t.List[Edit]:
212        edit_script: t.List[Edit] = []
213        for removed_node_id in self._unmatched_source_nodes:
214            edit_script.append(Remove(self._source_index[removed_node_id]))
215        for inserted_node_id in self._unmatched_target_nodes:
216            edit_script.append(Insert(self._target_index[inserted_node_id]))
217        for kept_source_node_id, kept_target_node_id in matchings.items():
218            source_node = self._source_index[kept_source_node_id]
219            target_node = self._target_index[kept_target_node_id]
220
221            identical_nodes = source_node == target_node
222
223            if not isinstance(source_node, UPDATABLE_EXPRESSION_TYPES) or identical_nodes:
224                if identical_nodes:
225                    source_parent = source_node.parent
226                    target_parent = target_node.parent
227
228                    if (
229                        (source_parent and not target_parent)
230                        or (not source_parent and target_parent)
231                        or (
232                            source_parent
233                            and target_parent
234                            and matchings.get(id(source_parent)) != id(target_parent)
235                        )
236                    ):
237                        edit_script.append(Move(source=source_node, target=target_node))
238                else:
239                    edit_script.extend(
240                        self._generate_move_edits(source_node, target_node, matchings)
241                    )
242
243                source_non_expression_leaves = dict(_get_non_expression_leaves(source_node))
244                target_non_expression_leaves = dict(_get_non_expression_leaves(target_node))
245
246                if source_non_expression_leaves != target_non_expression_leaves:
247                    edit_script.append(Update(source_node, target_node))
248                elif not delta_only:
249                    edit_script.append(Keep(source_node, target_node))
250            else:
251                edit_script.append(Update(source_node, target_node))
252
253        return edit_script
254
255    def _generate_move_edits(
256        self, source: exp.Expression, target: exp.Expression, matchings: t.Dict[int, int]
257    ) -> t.List[Move]:
258        source_args = [id(e) for e in _expression_only_args(source)]
259        target_args = [id(e) for e in _expression_only_args(target)]
260
261        args_lcs = set(
262            _lcs(source_args, target_args, lambda l, r: matchings.get(t.cast(int, l)) == r)
263        )
264
265        move_edits = []
266        for a in source_args:
267            if a not in args_lcs and a not in self._unmatched_source_nodes:
268                move_edits.append(
269                    Move(source=self._source_index[a], target=self._target_index[matchings[a]])
270                )
271
272        return move_edits
273
274    def _compute_matching_set(self) -> t.Set[t.Tuple[int, int]]:
275        leaves_matching_set = self._compute_leaf_matching_set()
276        matching_set = leaves_matching_set.copy()
277
278        ordered_unmatched_source_nodes = {
279            id(n): None for n in self._source.bfs() if id(n) in self._unmatched_source_nodes
280        }
281        ordered_unmatched_target_nodes = {
282            id(n): None for n in self._target.bfs() if id(n) in self._unmatched_target_nodes
283        }
284
285        for source_node_id in ordered_unmatched_source_nodes:
286            for target_node_id in ordered_unmatched_target_nodes:
287                source_node = self._source_index[source_node_id]
288                target_node = self._target_index[target_node_id]
289                if _is_same_type(source_node, target_node):
290                    source_leaf_ids = {id(l) for l in _get_expression_leaves(source_node)}
291                    target_leaf_ids = {id(l) for l in _get_expression_leaves(target_node)}
292
293                    max_leaves_num = max(len(source_leaf_ids), len(target_leaf_ids))
294                    if max_leaves_num:
295                        common_leaves_num = sum(
296                            1 if s in source_leaf_ids and t in target_leaf_ids else 0
297                            for s, t in leaves_matching_set
298                        )
299                        leaf_similarity_score = common_leaves_num / max_leaves_num
300                    else:
301                        leaf_similarity_score = 0.0
302
303                    adjusted_t = (
304                        self.t if min(len(source_leaf_ids), len(target_leaf_ids)) > 4 else 0.4
305                    )
306
307                    if leaf_similarity_score >= 0.8 or (
308                        leaf_similarity_score >= adjusted_t
309                        and self._dice_coefficient(source_node, target_node) >= self.f
310                    ):
311                        matching_set.add((source_node_id, target_node_id))
312                        self._unmatched_source_nodes.remove(source_node_id)
313                        self._unmatched_target_nodes.remove(target_node_id)
314                        ordered_unmatched_target_nodes.pop(target_node_id, None)
315                        break
316
317        return matching_set
318
319    def _compute_leaf_matching_set(self) -> t.Set[t.Tuple[int, int]]:
320        candidate_matchings: t.List[t.Tuple[float, int, int, exp.Expression, exp.Expression]] = []
321        source_expression_leaves = list(_get_expression_leaves(self._source))
322        target_expression_leaves = list(_get_expression_leaves(self._target))
323        for source_leaf in source_expression_leaves:
324            for target_leaf in target_expression_leaves:
325                if _is_same_type(source_leaf, target_leaf):
326                    similarity_score = self._dice_coefficient(source_leaf, target_leaf)
327                    if similarity_score >= self.f:
328                        heappush(
329                            candidate_matchings,
330                            (
331                                -similarity_score,
332                                -_parent_similarity_score(source_leaf, target_leaf),
333                                len(candidate_matchings),
334                                source_leaf,
335                                target_leaf,
336                            ),
337                        )
338
339        # Pick best matchings based on the highest score
340        matching_set = set()
341        while candidate_matchings:
342            _, _, _, source_leaf, target_leaf = heappop(candidate_matchings)
343            if (
344                id(source_leaf) in self._unmatched_source_nodes
345                and id(target_leaf) in self._unmatched_target_nodes
346            ):
347                matching_set.add((id(source_leaf), id(target_leaf)))
348                self._unmatched_source_nodes.remove(id(source_leaf))
349                self._unmatched_target_nodes.remove(id(target_leaf))
350
351        return matching_set
352
353    def _dice_coefficient(self, source: exp.Expression, target: exp.Expression) -> float:
354        source_histo = self._bigram_histo(source)
355        target_histo = self._bigram_histo(target)
356
357        total_grams = sum(source_histo.values()) + sum(target_histo.values())
358        if not total_grams:
359            return 1.0 if source == target else 0.0
360
361        overlap_len = 0
362        overlapping_grams = set(source_histo) & set(target_histo)
363        for g in overlapping_grams:
364            overlap_len += min(source_histo[g], target_histo[g])
365
366        return 2 * overlap_len / total_grams
367
368    def _bigram_histo(self, expression: exp.Expression) -> t.DefaultDict[str, int]:
369        if id(expression) in self._bigram_histo_cache:
370            return self._bigram_histo_cache[id(expression)]
371
372        expression_str = self._sql_generator.generate(expression)
373        count = max(0, len(expression_str) - 1)
374        bigram_histo: t.DefaultDict[str, int] = defaultdict(int)
375        for i in range(count):
376            bigram_histo[expression_str[i : i + 2]] += 1
377
378        self._bigram_histo_cache[id(expression)] = bigram_histo
379        return bigram_histo
380
381
382def _get_expression_leaves(expression: exp.Expression) -> t.Iterator[exp.Expression]:
383    has_child_exprs = False
384
385    for node in expression.iter_expressions():
386        if not isinstance(node, IGNORED_LEAF_EXPRESSION_TYPES):
387            has_child_exprs = True
388            yield from _get_expression_leaves(node)
389
390    if not has_child_exprs:
391        yield expression
392
393
394def _get_non_expression_leaves(expression: exp.Expression) -> t.Iterator[t.Tuple[str, t.Any]]:
395    for arg, value in expression.args.items():
396        if isinstance(value, exp.Expression) or (
397            isinstance(value, list) and isinstance(seq_get(value, 0), exp.Expression)
398        ):
399            continue
400
401        yield (arg, value)
402
403
404def _is_same_type(source: exp.Expression, target: exp.Expression) -> bool:
405    if type(source) is type(target):
406        if isinstance(source, exp.Join):
407            return source.args.get("side") == target.args.get("side")
408
409        if isinstance(source, exp.Anonymous):
410            return source.this == target.this
411
412        return True
413
414    return False
415
416
417def _parent_similarity_score(
418    source: t.Optional[exp.Expression], target: t.Optional[exp.Expression]
419) -> int:
420    if source is None or target is None or type(source) is not type(target):
421        return 0
422
423    return 1 + _parent_similarity_score(source.parent, target.parent)
424
425
426def _expression_only_args(expression: exp.Expression) -> t.Iterator[exp.Expression]:
427    yield from (
428        arg
429        for arg in expression.iter_expressions()
430        if not isinstance(arg, IGNORED_LEAF_EXPRESSION_TYPES)
431    )
432
433
434def _lcs(
435    seq_a: t.Sequence[T], seq_b: t.Sequence[T], equal: t.Callable[[T, T], bool]
436) -> t.Sequence[t.Optional[T]]:
437    """Calculates the longest common subsequence"""
438
439    len_a = len(seq_a)
440    len_b = len(seq_b)
441    lcs_result = [[None] * (len_b + 1) for i in range(len_a + 1)]
442
443    for i in range(len_a + 1):
444        for j in range(len_b + 1):
445            if i == 0 or j == 0:
446                lcs_result[i][j] = []  # type: ignore
447            elif equal(seq_a[i - 1], seq_b[j - 1]):
448                lcs_result[i][j] = lcs_result[i - 1][j - 1] + [seq_a[i - 1]]  # type: ignore
449            else:
450                lcs_result[i][j] = (
451                    lcs_result[i - 1][j]
452                    if len(lcs_result[i - 1][j]) > len(lcs_result[i][j - 1])  # type: ignore
453                    else lcs_result[i][j - 1]
454                )
455
456    return lcs_result[len_a][len_b]  # type: ignore

@dataclass(frozen=True)

class Insert: View Source

23@dataclass(frozen=True)
24class Insert:
25    """Indicates that a new node has been inserted"""
26
27    expression: exp.Expression

Indicates that a new node has been inserted

Insert(expression: sqlglot.expressions.Expression)

expression: sqlglot.expressions.Expression

@dataclass(frozen=True)

class Remove: View Source

30@dataclass(frozen=True)
31class Remove:
32    """Indicates that an existing node has been removed"""
33
34    expression: exp.Expression

Indicates that an existing node has been removed

Remove(expression: sqlglot.expressions.Expression)

expression: sqlglot.expressions.Expression

@dataclass(frozen=True)

class Move: View Source

37@dataclass(frozen=True)
38class Move:
39    """Indicates that an existing node's position within the tree has changed"""
40
41    source: exp.Expression
42    target: exp.Expression

Indicates that an existing node's position within the tree has changed

Move( source: sqlglot.expressions.Expression, target: sqlglot.expressions.Expression)

source: sqlglot.expressions.Expression

target: sqlglot.expressions.Expression

@dataclass(frozen=True)

class Update: View Source

45@dataclass(frozen=True)
46class Update:
47    """Indicates that an existing node has been updated"""
48
49    source: exp.Expression
50    target: exp.Expression

Indicates that an existing node has been updated

Update( source: sqlglot.expressions.Expression, target: sqlglot.expressions.Expression)

source: sqlglot.expressions.Expression

target: sqlglot.expressions.Expression

@dataclass(frozen=True)

class Keep: View Source

53@dataclass(frozen=True)
54class Keep:
55    """Indicates that an existing node hasn't been changed"""
56
57    source: exp.Expression
58    target: exp.Expression

Indicates that an existing node hasn't been changed

Keep( source: sqlglot.expressions.Expression, target: sqlglot.expressions.Expression)

source: sqlglot.expressions.Expression

target: sqlglot.expressions.Expression

def diff( source: sqlglot.expressions.Expression, target: sqlglot.expressions.Expression, matchings: Optional[List[Tuple[sqlglot.expressions.Expression, sqlglot.expressions.Expression]]] = None, delta_only: bool = False, **kwargs: Any) -> List[Union[Insert, Remove, Move, Update, Keep]]: View Source

 67def diff(
 68    source: exp.Expression,
 69    target: exp.Expression,
 70    matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None,
 71    delta_only: bool = False,
 72    **kwargs: t.Any,
 73) -> t.List[Edit]:
 74    """
 75    Returns the list of changes between the source and the target expressions.
 76
 77    Examples:
 78        >>> diff(parse_one("a + b"), parse_one("a + c"))
 79        [
 80            Remove(expression=(COLUMN this: (IDENTIFIER this: b, quoted: False))),
 81            Insert(expression=(COLUMN this: (IDENTIFIER this: c, quoted: False))),
 82            Keep(
 83                source=(ADD this: ...),
 84                target=(ADD this: ...)
 85            ),
 86            Keep(
 87                source=(COLUMN this: (IDENTIFIER this: a, quoted: False)),
 88                target=(COLUMN this: (IDENTIFIER this: a, quoted: False))
 89            ),
 90        ]
 91
 92    Args:
 93        source: the source expression.
 94        target: the target expression against which the diff should be calculated.
 95        matchings: the list of pre-matched node pairs which is used to help the algorithm's
 96            heuristics produce better results for subtrees that are known by a caller to be matching.
 97            Note: expression references in this list must refer to the same node objects that are
 98            referenced in the source / target trees.
 99        delta_only: excludes all `Keep` nodes from the diff.
100        kwargs: additional arguments to pass to the ChangeDistiller instance.
101
102    Returns:
103        the list of Insert, Remove, Move, Update and Keep objects for each node in the source and the
104        target expression trees. This list represents a sequence of steps needed to transform the source
105        expression tree into the target one.
106    """
107    matchings = matchings or []
108
109    def compute_node_mappings(
110        old_nodes: tuple[exp.Expression, ...], new_nodes: tuple[exp.Expression, ...]
111    ) -> t.Dict[int, exp.Expression]:
112        node_mapping = {}
113        for old_node, new_node in zip(reversed(old_nodes), reversed(new_nodes)):
114            new_node._hash = hash(new_node)
115            node_mapping[id(old_node)] = new_node
116
117        return node_mapping
118
119    # if the source and target have any shared objects, that means there's an issue with the ast
120    # the algorithm won't work because the parent / hierarchies will be inaccurate
121    source_nodes = tuple(source.walk())
122    target_nodes = tuple(target.walk())
123    source_ids = {id(n) for n in source_nodes}
124    target_ids = {id(n) for n in target_nodes}
125
126    copy = (
127        len(source_nodes) != len(source_ids)
128        or len(target_nodes) != len(target_ids)
129        or source_ids & target_ids
130    )
131
132    source_copy = source.copy() if copy else source
133    target_copy = target.copy() if copy else target
134
135    try:
136        # We cache the hash of each new node here to speed up equality comparisons. If the input
137        # trees aren't copied, these hashes will be evicted before returning the edit script.
138        if copy and matchings:
139            source_mapping = compute_node_mappings(source_nodes, tuple(source_copy.walk()))
140            target_mapping = compute_node_mappings(target_nodes, tuple(target_copy.walk()))
141            matchings = [(source_mapping[id(s)], target_mapping[id(t)]) for s, t in matchings]
142        else:
143            for node in chain(reversed(source_nodes), reversed(target_nodes)):
144                node._hash = hash(node)
145
146        edit_script = ChangeDistiller(**kwargs).diff(
147            source_copy,
148            target_copy,
149            matchings=matchings,
150            delta_only=delta_only,
151        )
152    finally:
153        if not copy:
154            for node in chain(source_nodes, target_nodes):
155                node._hash = None
156
157    return edit_script

Returns the list of changes between the source and the target expressions.

Examples:

>>> diff(parse_one("a + b"), parse_one("a + c"))
[
    Remove(expression=(COLUMN this: (IDENTIFIER this: b, quoted: False))),
    Insert(expression=(COLUMN this: (IDENTIFIER this: c, quoted: False))),
    Keep(
        source=(ADD this: ...),
        target=(ADD this: ...)
    ),
    Keep(
        source=(COLUMN this: (IDENTIFIER this: a, quoted: False)),
        target=(COLUMN this: (IDENTIFIER this: a, quoted: False))
    ),
]

Arguments:

source: the source expression.
target: the target expression against which the diff should be calculated.
matchings: the list of pre-matched node pairs which is used to help the algorithm's heuristics produce better results for subtrees that are known by a caller to be matching. Note: expression references in this list must refer to the same node objects that are referenced in the source / target trees.
delta_only: excludes all Keep nodes from the diff.
kwargs: additional arguments to pass to the ChangeDistiller instance.

Returns:

the list of Insert, Remove, Move, Update and Keep objects for each node in the source and the target expression trees. This list represents a sequence of steps needed to transform the source expression tree into the target one.

UPDATABLE_EXPRESSION_TYPES = (<class 'sqlglot.expressions.Alias'>, <class 'sqlglot.expressions.Boolean'>, <class 'sqlglot.expressions.Column'>, <class 'sqlglot.expressions.DataType'>, <class 'sqlglot.expressions.Lambda'>, <class 'sqlglot.expressions.Literal'>, <class 'sqlglot.expressions.Table'>, <class 'sqlglot.expressions.Window'>)

IGNORED_LEAF_EXPRESSION_TYPES = (<class 'sqlglot.expressions.Identifier'>,)

class ChangeDistiller: View Source

175class ChangeDistiller:
176    """
177    The implementation of the Change Distiller algorithm described by Beat Fluri and Martin Pinzger in
178    their paper https://ieeexplore.ieee.org/document/4339230, which in turn is based on the algorithm by
179    Chawathe et al. described in http://ilpubs.stanford.edu:8090/115/1/1995-46.pdf.
180    """
181
182    def __init__(self, f: float = 0.6, t: float = 0.6, dialect: DialectType = None) -> None:
183        self.f = f
184        self.t = t
185        self._sql_generator = Dialect.get_or_raise(dialect).generator()
186
187    def diff(
188        self,
189        source: exp.Expression,
190        target: exp.Expression,
191        matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None,
192        delta_only: bool = False,
193    ) -> t.List[Edit]:
194        matchings = matchings or []
195        pre_matched_nodes = {id(s): id(t) for s, t in matchings}
196
197        self._source = source
198        self._target = target
199        self._source_index = {
200            id(n): n for n in self._source.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES)
201        }
202        self._target_index = {
203            id(n): n for n in self._target.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES)
204        }
205        self._unmatched_source_nodes = set(self._source_index) - set(pre_matched_nodes)
206        self._unmatched_target_nodes = set(self._target_index) - set(pre_matched_nodes.values())
207        self._bigram_histo_cache: t.Dict[int, t.DefaultDict[str, int]] = {}
208
209        matching_set = self._compute_matching_set() | set(pre_matched_nodes.items())
210        return self._generate_edit_script(dict(matching_set), delta_only)
211
212    def _generate_edit_script(self, matchings: t.Dict[int, int], delta_only: bool) -> t.List[Edit]:
213        edit_script: t.List[Edit] = []
214        for removed_node_id in self._unmatched_source_nodes:
215            edit_script.append(Remove(self._source_index[removed_node_id]))
216        for inserted_node_id in self._unmatched_target_nodes:
217            edit_script.append(Insert(self._target_index[inserted_node_id]))
218        for kept_source_node_id, kept_target_node_id in matchings.items():
219            source_node = self._source_index[kept_source_node_id]
220            target_node = self._target_index[kept_target_node_id]
221
222            identical_nodes = source_node == target_node
223
224            if not isinstance(source_node, UPDATABLE_EXPRESSION_TYPES) or identical_nodes:
225                if identical_nodes:
226                    source_parent = source_node.parent
227                    target_parent = target_node.parent
228
229                    if (
230                        (source_parent and not target_parent)
231                        or (not source_parent and target_parent)
232                        or (
233                            source_parent
234                            and target_parent
235                            and matchings.get(id(source_parent)) != id(target_parent)
236                        )
237                    ):
238                        edit_script.append(Move(source=source_node, target=target_node))
239                else:
240                    edit_script.extend(
241                        self._generate_move_edits(source_node, target_node, matchings)
242                    )
243
244                source_non_expression_leaves = dict(_get_non_expression_leaves(source_node))
245                target_non_expression_leaves = dict(_get_non_expression_leaves(target_node))
246
247                if source_non_expression_leaves != target_non_expression_leaves:
248                    edit_script.append(Update(source_node, target_node))
249                elif not delta_only:
250                    edit_script.append(Keep(source_node, target_node))
251            else:
252                edit_script.append(Update(source_node, target_node))
253
254        return edit_script
255
256    def _generate_move_edits(
257        self, source: exp.Expression, target: exp.Expression, matchings: t.Dict[int, int]
258    ) -> t.List[Move]:
259        source_args = [id(e) for e in _expression_only_args(source)]
260        target_args = [id(e) for e in _expression_only_args(target)]
261
262        args_lcs = set(
263            _lcs(source_args, target_args, lambda l, r: matchings.get(t.cast(int, l)) == r)
264        )
265
266        move_edits = []
267        for a in source_args:
268            if a not in args_lcs and a not in self._unmatched_source_nodes:
269                move_edits.append(
270                    Move(source=self._source_index[a], target=self._target_index[matchings[a]])
271                )
272
273        return move_edits
274
275    def _compute_matching_set(self) -> t.Set[t.Tuple[int, int]]:
276        leaves_matching_set = self._compute_leaf_matching_set()
277        matching_set = leaves_matching_set.copy()
278
279        ordered_unmatched_source_nodes = {
280            id(n): None for n in self._source.bfs() if id(n) in self._unmatched_source_nodes
281        }
282        ordered_unmatched_target_nodes = {
283            id(n): None for n in self._target.bfs() if id(n) in self._unmatched_target_nodes
284        }
285
286        for source_node_id in ordered_unmatched_source_nodes:
287            for target_node_id in ordered_unmatched_target_nodes:
288                source_node = self._source_index[source_node_id]
289                target_node = self._target_index[target_node_id]
290                if _is_same_type(source_node, target_node):
291                    source_leaf_ids = {id(l) for l in _get_expression_leaves(source_node)}
292                    target_leaf_ids = {id(l) for l in _get_expression_leaves(target_node)}
293
294                    max_leaves_num = max(len(source_leaf_ids), len(target_leaf_ids))
295                    if max_leaves_num:
296                        common_leaves_num = sum(
297                            1 if s in source_leaf_ids and t in target_leaf_ids else 0
298                            for s, t in leaves_matching_set
299                        )
300                        leaf_similarity_score = common_leaves_num / max_leaves_num
301                    else:
302                        leaf_similarity_score = 0.0
303
304                    adjusted_t = (
305                        self.t if min(len(source_leaf_ids), len(target_leaf_ids)) > 4 else 0.4
306                    )
307
308                    if leaf_similarity_score >= 0.8 or (
309                        leaf_similarity_score >= adjusted_t
310                        and self._dice_coefficient(source_node, target_node) >= self.f
311                    ):
312                        matching_set.add((source_node_id, target_node_id))
313                        self._unmatched_source_nodes.remove(source_node_id)
314                        self._unmatched_target_nodes.remove(target_node_id)
315                        ordered_unmatched_target_nodes.pop(target_node_id, None)
316                        break
317
318        return matching_set
319
320    def _compute_leaf_matching_set(self) -> t.Set[t.Tuple[int, int]]:
321        candidate_matchings: t.List[t.Tuple[float, int, int, exp.Expression, exp.Expression]] = []
322        source_expression_leaves = list(_get_expression_leaves(self._source))
323        target_expression_leaves = list(_get_expression_leaves(self._target))
324        for source_leaf in source_expression_leaves:
325            for target_leaf in target_expression_leaves:
326                if _is_same_type(source_leaf, target_leaf):
327                    similarity_score = self._dice_coefficient(source_leaf, target_leaf)
328                    if similarity_score >= self.f:
329                        heappush(
330                            candidate_matchings,
331                            (
332                                -similarity_score,
333                                -_parent_similarity_score(source_leaf, target_leaf),
334                                len(candidate_matchings),
335                                source_leaf,
336                                target_leaf,
337                            ),
338                        )
339
340        # Pick best matchings based on the highest score
341        matching_set = set()
342        while candidate_matchings:
343            _, _, _, source_leaf, target_leaf = heappop(candidate_matchings)
344            if (
345                id(source_leaf) in self._unmatched_source_nodes
346                and id(target_leaf) in self._unmatched_target_nodes
347            ):
348                matching_set.add((id(source_leaf), id(target_leaf)))
349                self._unmatched_source_nodes.remove(id(source_leaf))
350                self._unmatched_target_nodes.remove(id(target_leaf))
351
352        return matching_set
353
354    def _dice_coefficient(self, source: exp.Expression, target: exp.Expression) -> float:
355        source_histo = self._bigram_histo(source)
356        target_histo = self._bigram_histo(target)
357
358        total_grams = sum(source_histo.values()) + sum(target_histo.values())
359        if not total_grams:
360            return 1.0 if source == target else 0.0
361
362        overlap_len = 0
363        overlapping_grams = set(source_histo) & set(target_histo)
364        for g in overlapping_grams:
365            overlap_len += min(source_histo[g], target_histo[g])
366
367        return 2 * overlap_len / total_grams
368
369    def _bigram_histo(self, expression: exp.Expression) -> t.DefaultDict[str, int]:
370        if id(expression) in self._bigram_histo_cache:
371            return self._bigram_histo_cache[id(expression)]
372
373        expression_str = self._sql_generator.generate(expression)
374        count = max(0, len(expression_str) - 1)
375        bigram_histo: t.DefaultDict[str, int] = defaultdict(int)
376        for i in range(count):
377            bigram_histo[expression_str[i : i + 2]] += 1
378
379        self._bigram_histo_cache[id(expression)] = bigram_histo
380        return bigram_histo

The implementation of the Change Distiller algorithm described by Beat Fluri and Martin Pinzger in their paper https://ieeexplore.ieee.org/document/4339230, which in turn is based on the algorithm by Chawathe et al. described in http://ilpubs.stanford.edu:8090/115/1/1995-46.pdf.

ChangeDistiller( f: float = 0.6, t: float = 0.6, dialect: Union[str, sqlglot.dialects.Dialect, Type[sqlglot.dialects.Dialect], NoneType] = None) View Source

182    def __init__(self, f: float = 0.6, t: float = 0.6, dialect: DialectType = None) -> None:
183        self.f = f
184        self.t = t
185        self._sql_generator = Dialect.get_or_raise(dialect).generator()

def diff( self, source: sqlglot.expressions.Expression, target: sqlglot.expressions.Expression, matchings: Optional[List[Tuple[sqlglot.expressions.Expression, sqlglot.expressions.Expression]]] = None, delta_only: bool = False) -> List[Union[Insert, Remove, Move, Update, Keep]]: View Source

187    def diff(
188        self,
189        source: exp.Expression,
190        target: exp.Expression,
191        matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None,
192        delta_only: bool = False,
193    ) -> t.List[Edit]:
194        matchings = matchings or []
195        pre_matched_nodes = {id(s): id(t) for s, t in matchings}
196
197        self._source = source
198        self._target = target
199        self._source_index = {
200            id(n): n for n in self._source.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES)
201        }
202        self._target_index = {
203            id(n): n for n in self._target.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES)
204        }
205        self._unmatched_source_nodes = set(self._source_index) - set(pre_matched_nodes)
206        self._unmatched_target_nodes = set(self._target_index) - set(pre_matched_nodes.values())
207        self._bigram_histo_cache: t.Dict[int, t.DefaultDict[str, int]] = {}
208
209        matching_set = self._compute_matching_set() | set(pre_matched_nodes.items())
210        return self._generate_edit_script(dict(matching_set), delta_only)