Semantic Diff for SQL
Motivation
Software is constantly changing and evolving, and identifying what has changed and reviewing those changes is an integral part of the development process. SQL code is no exception to this.
Text-based diff tools such as git diff
, when applied to a code base, have certain limitations. First, they can only detect insertions and deletions, not movements or updates of individual pieces of code. Second, such tools can only detect changes between lines of text, which is too coarse for something as granular and detailed as source code. Additionally, the outcome of such a diff is dependent on the underlying code formatting, and yields different results if the formatting should change.
Consider the following diff generated by Git:
Semantically the query hasn’t changed. The two arguments b
and c
have been swapped (moved), posing no impact on the output of the query. Yet Git replaced the whole affected expression alongside a bulk of unrelated elements.
The alternative to text-based diffing is to compare Abstract Syntax Trees (AST) instead. The main advantage of ASTs are that they are a direct product of code parsing, which represents the underlying code structure at any desired level of granularity. Comparing ASTs may yield extremely precise diffs; changes such as code movements and updates can also be detected. Even more importantly, this approach facilitates additional use cases beyond eyeballing two versions of source code side by side.
The use cases I had in mind for SQL when I decided to embark on this journey of semantic diffing were the following:
- Query similarity score. Identifying which parts the two queries have in common to automatically suggest opportunities for consolidation, creation of intermediate/staging tables, and so on.
- Differentiating between cosmetic / structural changes and functional ones. For example when a nested query is refactored into a common table expression (CTE), this kind of change doesn’t have any functional impact on either a query or its outcome.
- Automatic suggestions about the need to retroactively backfill data. This is especially important for pipelines that populate very large tables for which restatement is a runtime-intensive procedure. The ability to discern between simple code movements and actual modifications can help assess the impact of a change and make suggestions accordingly.
The implementation discussed in this post is now a part of the SQLGlot library. You can find a complete source code in the diff.py module. The choice of SQLglot was an obvious one due to its simple but powerful API, lack of external dependencies and, more importantly, extensive list of supported SQL dialects.
The Search for a Solution
When it comes to any diffing tool (not just a semantic one), the primary challenge is to match as many elements of compared entities as possible. Once such a set of matching elements is available, deriving a sequence of changes becomes an easy task.
If our elements have unique identifiers associated with them (for example, an element’s ID in DOM), the matching problem is trivial. However, the SQL syntax trees that we are comparing have neither unique keys nor object identifiers that can be used for the purposes of matching. So, how do we suppose to find pairs of nodes that are related?
To better illustrate the problem, consider comparing the following SQL expressions: SELECT a + b + c, d, e
and SELECT a - b + c, e, f
. Matching individual nodes from respective syntax trees can be visualized as follows:
Figure 1: Example of node matching for two SQL expression trees.
By looking at the figure of node matching for two SQL expression trees above, we conclude that the following changes should be captured by our solution:
- Inserted nodes:
Sub
andf
. These are the nodes from the target AST which do not have a matching node in the source AST. - Removed nodes:
Add
andd
. These are the nodes from the source AST which do not have a counterpart in the target AST. - Remaining nodes must be identified as unchanged.
It should be clear at this point that if we manage to match nodes in the source tree with their counterparts in the target tree, then computing the diff becomes a trivial matter.
Naïve Brute-Force
The naïve solution would be to try all different permutations of node pair combinations, and see which set of pairs performs the best based on some type of heuristics. The runtime cost of such a solution quickly reaches the escape velocity; if both trees had only 10 nodes each, the number of such sets would approximately be 10! ^ 2 = 3.6M ^ 2 ~= 13 * 10^12. This is a very bad case of factorial complexity (to be precise, it’s actually much worse - O(n! ^ 2) - but I couldn’t come up with a name for it), so there is little need to explore this approach any further.
Myers Algorithm
After the naïve approach was proven to be infeasible, the next question I asked myself was “how does git diff work?”. This question led me to discover the Myers diff algorithm [1]. This algorithm has been designed to compare sequences of strings. At its core, it’s looking for the shortest path on a graph of possible edits that transform the first sequence into the second one, while heavily rewarding those paths that lead to longest subsequences of unchanged elements. There’s a lot of material out there describing this algorithm in greater detail. I found James Coglan’s series of blog posts to be the most comprehensive.
Therefore, I had this “brilliant” (actually not) idea to transform trees into sequences by traversing them in topological order, and then applying the Myers algorithm on resulting sequences while using a custom heuristics when checking the equality of two nodes. Unsurprisingly, comparing sequences of strings is quite different from comparing hierarchical tree structures, and by flattening trees into sequences, we lose a lot of relevant context. This resulted in a terrible performance of this algorithm on ASTs. It often matched completely unrelated nodes, even when the two trees were mostly the same, and produced extremely inaccurate lists of changes overall. After playing around with it a little and tweaking my equality heuristics to improve accuracy, I ultimately scrapped the whole implementation and went back to the drawing board.
Change Distiller
The algorithm I settled on at the end was Change Distiller, created by Fluri et al. [2], which in turn is an improvement over the core idea described by Chawathe et al. [3].
The algorithm consists of two high-level steps:
- Finding appropriate matchings between pairs of nodes that are part of compared ASTs. Identifying what is meant by “appropriate” matching is also a part of this step.
- Generating the so-called “edit script” from the matching set built in the 1st step. The edit script is a sequence of edit operations (for example, insert, remove, update, etc.) on individual tree nodes, such that when applied as transformations on the source AST, it eventually becomes the target AST. In general, the shorter the sequence, the better. The length of the edit script can be used to compare the performance of different algorithms, though this is not the only metric that matters.
The rest of this section is dedicated to the Python implementation of the steps above using the AST implementation provided by the SQLGlot library.
Building the Matching Set
Matching Leaves
We begin composing the matching set by matching the leaf nodes. Leaf nodes are the nodes that do not have any children nodes (such as literals, identifiers, etc.). In order to match them, we gather all the leaf nodes from the source tree and generate a cartesian product with all the leaves from the target tree, while comparing pairs created this way and assigning them a similarity score. During this stage, we also exclude pairs that don’t pass basic matching criteria. Then, we pick pairs that scored the highest while making sure that each node is matched no more than once.
Using the example provided at the beginning of the post, the process of building an initial set of candidate matchings can be seen on Figure 2.
Figure 2: Building a set of candidate matchings between leaf nodes. The third item in each triplet represents a similarity score between two nodes.
First, let’s analyze the similarity score. Then, we’ll discuss matching criteria.
The similarity score proposed by Fluri et al. [2] is a dice coefficient applied to bigrams of respective node values. A bigram is a sequence of two adjacent elements from a string computed in a sliding window fashion:
def bigram(string):
count = max(0, len(string) - 1)
return [string[i : i + 2] for i in range(count)]
For reasons that will become clear shortly, we actually need to compute bigram histograms rather than just sequences:
from collections import defaultdict
def bigram_histo(string):
count = max(0, len(string) - 1)
bigram_histo = defaultdict(int)
for i in range(count):
bigram_histo[string[i : i + 2]] += 1
return bigram_histo
The dice coefficient formula looks like following:
Where X is a bigram of the source node and Y is a bigram of the second one. What this essentially does is count the number of bigram elements the two nodes have in common, multiply it by 2, and then divide by the total number of elements in both bigrams. This is where bigram histograms come in handy:
def dice_coefficient(source, target):
source_histo = bigram_histo(source.sql())
target_histo = bigram_histo(target.sql())
total_grams = (
sum(source_histo.values()) + sum(target_histo.values())
)
if not total_grams:
return 1.0 if source == target else 0.0
overlap_len = 0
overlapping_grams = set(source_histo) & set(target_histo)
for g in overlapping_grams:
overlap_len += min(source_histo[g], target_histo[g])
return 2 * overlap_len / total_grams
To compute a bigram given a tree node, we first transform the node into its canonical SQL representation,so that the Literal(123)
node becomes just “123” and the Identifier(“a”)
node becomes just “a”. We also handle a scenario when strings are too short to derive bigrams. In this case, we fallback to checking the two nodes for equality.
Now when we know how to compute the similarity score, we can take care of the matching criteria for leaf nodes. In the original paper [2], the matching criteria is formalized as follows:
The two nodes are matched if two conditions are met:
- The node labels match (in our case labels are just node types).
- The similarity score for node values is greater than or equal to some threshold “f”. The authors of the paper recommend setting the value of “f” to 0.6.
With building blocks in place, we can now build a matching set for leaf nodes. First, we generate a list of candidates for matching:
from heapq import heappush, heappop
candidate_matchings = []
source_leaves = _get_leaves(self._source)
target_leaves = _get_leaves(self._target)
for source_leaf in source_leaves:
for target_leaf in target_leaves:
if _is_same_type(source_leaf, target_leaf):
similarity_score = dice_coefficient(
source_leaf, target_leaf
)
if similarity_score >= 0.6:
heappush(
candidate_matchings,
(
-similarity_score,
len(candidate_matchings),
source_leaf,
target_leaf,
),
)
In the implementation above, we push each matching pair onto the heap to automatically maintain the correct order based on the assigned similarity score.
Finally, we build the initial matching set by picking leaf pairs with the highest score:
matching_set = set()
while candidate_matchings:
_, _, source_leaf, target_leaf = heappop(candidate_matchings)
if (
source_leaf in unmatched_source_nodes
and target_leaf in unmatched_target_nodes
):
matching_set.add((source_leaf, target_leaf))
unmatched_source_nodes.remove(source_leaf)
unmatched_target_nodes.remove(target_leaf)
To finalize the matching set, we should now proceed with matching inner nodes.
Matching Inner Nodes
Matching inner nodes is quite similar to matching leaf nodes, with the following two distinctions:
- Rather than ranking a set of possible candidates, we pick the first node pair that passes the matching criteria.
- The matching criteria itself has been extended to account for the number of leaf nodes the pair of inner nodes have in common.
Figure 3: Matching inner nodes based on their type as well as how many of their leaf nodes have been previously matched.
Let’s start with the matching criteria. The criteria is formalized as follows:
Alongside already familiar similarity score and node type criteria, there is a new one in the middle: the ratio of leaf nodes that the two nodes have in common must exceed some threshold “t”. The recommended value for “t” is also 0.6. Counting the number of common leaf nodes is pretty straightforward, since we already have the complete matching set for leaves. All we need to do is count how many matching pairs do leaf nodes from the two compared inner nodes form.
There are two additional heuristics associated with this matching criteria:
- Inner node similarity weighting: if the similarity score between the node values doesn’t pass the threshold “f” but the ratio of common leaf nodes (“t”) is greater than or equal to 0.8, then the matching is considered successful.
- The threshold “t” is reduced to 0.4 for inner nodes with the number of leaf nodes equal to 4 or less, in order to decrease the false negative rate for small subtrees.
We now only have to iterate through the remaining unmatched nodes and form matching pairs based on the outlined criteria:
leaves_matching_set = matching_set.copy()
for source_node in unmatched_source_nodes.copy():
for target_node in unmatched_target_nodes:
if _is_same_type(source_node, target_node):
source_leaves = set(_get_leaves(source_node))
target_leaves = set(_get_leaves(target_node))
max_leaves_num = max(len(source_leaves), len(target_leaves))
if max_leaves_num:
common_leaves_num = sum(
1 if s in source_leaves and t in target_leaves else 0
for s, t in leaves_matching_set
)
leaf_similarity_score = common_leaves_num / max_leaves_num
else:
leaf_similarity_score = 0.0
adjusted_t = (
0.6
if min(len(source_leaves), len(target_leaves)) > 4
else 0.4
)
if leaf_similarity_score >= 0.8 or (
leaf_similarity_score >= adjusted_t
and dice_coefficient(source_node, target_node) >= 0.6
):
matching_set.add((source_node, target_node))
unmatched_source_nodes.remove(source_node)
unmatched_target_nodes.remove(target_node)
break
After the matching set is formed, we can proceed with generation of the edit script, which will be the algorithm’s output.
Generating the Edit Script
At this point, we should have the following 3 sets at our disposal:
- The set of matched node pairs.
- The set of remaining unmatched nodes from the source tree.
- The set of remaining unmatched nodes from the target tree.
We can derive 3 kinds of edits from the matching set: either the node’s value was updated (Update), the node was moved to a different position within the tree (Move), or the node remained unchanged (Keep). Note that the Move case is not mutually exclusive with the other two. The node could have been updated or could have remained the same while at the same time its position within its parent node or the parent node itself could have changed. All unmatched nodes from the source tree are the ones that were removed (Remove), while unmatched nodes from the target tree are the ones that were inserted (Insert).
The latter two cases are pretty straightforward to implement:
edit_script = []
for removed_node in unmatched_source_nodes:
edit_script.append(Remove(removed_node))
for inserted_node in unmatched_target_nodes:
edit_script.append(Insert(inserted_node))
Traversing the matching set requires a little more thought:
for source_node, target_node in matching_set:
if (
not isinstance(source_node, LEAF_EXPRESSION_TYPES)
or source_node == target_node
):
move_edits = generate_move_edits(
source_node, target_node, matching_set
)
edit_script.extend(move_edits)
edit_script.append(Keep(source_node, target_node))
else:
edit_script.append(Update(source_node, target_node))
If a matching pair represents a pair of leaf nodes, we check if they are the same to decide whether an update took place. For inner node pairs, we also need to compare the positions of their respective children to detect node movements. Chawathe et al. [3] suggest applying the longest common subsequence (LCS) algorithm which, no surprise here, was described by Myers himself [1]. There is a small catch, however: instead of checking the equality of two children nodes, we need to check whether the two nodes form a pair that is a part of our matching set.
Now with this knowledge, the implementation becomes straightforward:
def generate_move_edits(source, target, matching_set):
source_children = _get_child_nodes(source)
target_children = _get_child_nodes(target)
lcs = set(
_longest_common_subsequence(
source_children,
target_children,
lambda l, r: (l, r) in matching_set
)
)
move_edits = []
for node in source_children:
if node not in lcs and node not in unmatched_source_nodes:
move_edits.append(Move(node))
return move_edits
I left out the implementation of the LCS algorithm itself here, but there are plenty of implementation choices out there that can be easily looked up.
Output
The implemented algorithm produces the output that resembles the following:
>>> from sqlglot import parse_one, diff
>>> diff(parse_one("SELECT a + b + c, d, e"), parse_one("SELECT a - b + c, e, f"))
Remove(Add)
Remove(Column(d))
Remove(Identifier(d))
Insert(Sub)
Insert(Column(f))
Insert(Identifier(f))
Keep(Select, Select)
Keep(Add, Add)
Keep(Column(a), Column(a))
Keep(Identifier(a), Identifier(a))
Keep(Column(b), Column(b))
Keep(Identifier(b), Identifier(b))
Keep(Column(c), Column(c))
Keep(Identifier(c), Identifier(c))
Keep(Column(e), Column(e))
Keep(Identifier(e), Identifier(e))
Note that the output above is abbreviated. The string representation of actual AST nodes is significantly more verbose.
The implementation works especially well when coupled with the SQLGlot’s query optimizer which can be used to produce canonical representations of compared queries:
>>> schema={"t": {"a": "INT", "b": "INT", "c": "INT", "d": "INT"}}
>>> source = """
... SELECT 1 + 1 + a
... FROM t
... WHERE b = 1 OR (c = 2 AND d = 3)
... """
>>> target = """
... SELECT 2 + a
... FROM t
... WHERE (b = 1 OR c = 2) AND (b = 1 OR d = 3)
... """
>>> optimized_source = optimize(parse_one(source), schema=schema)
>>> optimized_target = optimize(parse_one(target), schema=schema)
>>> edit_script = diff(optimized_source, optimized_target)
>>> sum(0 if isinstance(e, Keep) else 1 for e in edit_script)
0
Optimizations
The worst case runtime complexity of this algorithm is not exactly stellar: O(n^2 * log n^2). This is because of the leaf matching process, which involves ranking a cartesian product between all leaf nodes of compared trees. Unsurprisingly, the algorithm takes a considerable time to finish for bigger queries.
There are still a few basic things we can do in our implementation to help improve performance:
- Refer to individual node objects using their identifiers (Python’s id()) instead of direct references in sets. This helps avoid costly recursive hash calculations and equality checks.
- Cache bigram histograms to avoid computing them more than once for the same node.
- Compute the canonical SQL string representation for each tree once while caching string representations of all inner nodes. This prevents redundant tree traversals when bigrams are computed.
At the time of writing only the first two optimizations have been implemented, so there is an opportunity to contribute for anyone who’s interested.
Alternative Solutions
This section is dedicated to solutions that I’ve investigated, but haven’t tried.
First, this section wouldn’t be complete without Tristan Hume’s blog post. Tristan’s solution has a lot in common with the Myers algorithm plus heuristics that is much more clever than what I came up with. The implementation relies on a combination of dynamic programming and A* search algorithm to explore the space of possible matchings and pick the best ones. It seemed to have worked well for Tistan’s specific use case, but after my negative experience with the Myers algorithm, I decided to try something different.
Another notable approach is the Gumtree algorithm by Falleri et al. [4]. I discovered this paper after I’d already implemented the algorithm that is the main focus of this post. In sections 5.2 and 5.3 of their paper, the authors compare the two algorithms side by side and claim that Gumtree is significantly better in terms of both runtime performance and accuracy when evaluated on 12 792 pairs of Java source files. This doesn’t surprise me, as the algorithm takes the height of subtrees into account. In my tests, I definitely saw scenarios in which this context would have helped. On top of that, the authors promise O(n^2) runtime complexity in the worst case which, given the Change Distiller's O(n^2 * log n^2), looks particularly tempting. I hope to try this algorithm out at some point, and there is a good chance you see me writing about it in my future posts.
Conclusion
The Change Distiller algorithm yielded quite satisfactory results in most of my tests. The scenarios in which it fell short mostly concerned identical (or very similar) subtrees located in different parts of the AST. In those cases, node mismatches were frequent and, as a result, edit scripts were somewhat suboptimal.
Additionally, the runtime performance of the algorithm leaves a lot to be desired. On trees with 1000 leaf nodes each, the algorithm takes a little under 2 seconds to complete. My implementation still has room for improvement, but this should give you a rough idea of what to expect. It appears that the Gumtree algorithm [4] can help address both of these points. I hope to find bandwidth to work on it soon and then compare the two algorithms side-by-side to find out which one performs better on SQL specifically. In the meantime, Change Distiller definitely gets the job done, and I can now proceed with applying it to some of the use cases I mentioned at the beginning of this post.
I’m also curious to learn whether other folks in the industry faced a similar problem, and how they approached it. If you did something similar, I’m interested to hear about your experience.
References
[1] Eugene W. Myers. An O(ND) Difference Algorithm and Its Variations. Algorithmica 1(2): 251-266 (1986)
[2] B. Fluri, M. Wursch, M. Pinzger, and H. Gall. Change Distilling: Tree differencing for fine-grained source code change extraction. IEEE Trans. Software Eng., 33(11):725–743, 2007.
[3] S.S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change Detection in Hierarchically Structured Information. Proc. ACM Sigmod Int’l Conf. Management of Data, pp. 493-504, June 1996
[4] Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, Martin Monperrus. Fine-grained and Accurate Source Code Differencing. Proceedings of the International Conference on Automated Software Engineering, 2014, Västeras, Sweden. pp.313-324, 10.1145/2642937.2642982. hal-01054552
1""" 2.. include:: ../posts/sql_diff.md 3 4---- 5""" 6 7from __future__ import annotations 8 9import typing as t 10from collections import defaultdict 11from dataclasses import dataclass 12from heapq import heappop, heappush 13from itertools import chain 14 15from sqlglot import Dialect, expressions as exp 16from sqlglot.helper import seq_get 17 18if t.TYPE_CHECKING: 19 from sqlglot.dialects.dialect import DialectType 20 21 22@dataclass(frozen=True) 23class Insert: 24 """Indicates that a new node has been inserted""" 25 26 expression: exp.Expression 27 28 29@dataclass(frozen=True) 30class Remove: 31 """Indicates that an existing node has been removed""" 32 33 expression: exp.Expression 34 35 36@dataclass(frozen=True) 37class Move: 38 """Indicates that an existing node's position within the tree has changed""" 39 40 source: exp.Expression 41 target: exp.Expression 42 43 44@dataclass(frozen=True) 45class Update: 46 """Indicates that an existing node has been updated""" 47 48 source: exp.Expression 49 target: exp.Expression 50 51 52@dataclass(frozen=True) 53class Keep: 54 """Indicates that an existing node hasn't been changed""" 55 56 source: exp.Expression 57 target: exp.Expression 58 59 60if t.TYPE_CHECKING: 61 from sqlglot._typing import T 62 63 Edit = t.Union[Insert, Remove, Move, Update, Keep] 64 65 66def diff( 67 source: exp.Expression, 68 target: exp.Expression, 69 matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None, 70 delta_only: bool = False, 71 copy: bool = True, 72 **kwargs: t.Any, 73) -> t.List[Edit]: 74 """ 75 Returns the list of changes between the source and the target expressions. 76 77 Examples: 78 >>> diff(parse_one("a + b"), parse_one("a + c")) 79 [ 80 Remove(expression=(COLUMN this: (IDENTIFIER this: b, quoted: False))), 81 Insert(expression=(COLUMN this: (IDENTIFIER this: c, quoted: False))), 82 Keep( 83 source=(ADD this: ...), 84 target=(ADD this: ...) 85 ), 86 Keep( 87 source=(COLUMN this: (IDENTIFIER this: a, quoted: False)), 88 target=(COLUMN this: (IDENTIFIER this: a, quoted: False)) 89 ), 90 ] 91 92 Args: 93 source: the source expression. 94 target: the target expression against which the diff should be calculated. 95 matchings: the list of pre-matched node pairs which is used to help the algorithm's 96 heuristics produce better results for subtrees that are known by a caller to be matching. 97 Note: expression references in this list must refer to the same node objects that are 98 referenced in the source / target trees. 99 delta_only: excludes all `Keep` nodes from the diff. 100 copy: whether to copy the input expressions. 101 Note: if this is set to false, the caller must ensure that there are no shared references 102 in the two trees, otherwise the diffing algorithm may produce unexpected behavior. 103 kwargs: additional arguments to pass to the ChangeDistiller instance. 104 105 Returns: 106 the list of Insert, Remove, Move, Update and Keep objects for each node in the source and the 107 target expression trees. This list represents a sequence of steps needed to transform the source 108 expression tree into the target one. 109 """ 110 matchings = matchings or [] 111 matching_ids = {id(n) for pair in matchings for n in pair} 112 113 def compute_node_mappings( 114 original: exp.Expression, copy: exp.Expression 115 ) -> t.Dict[int, exp.Expression]: 116 node_mapping = {} 117 for old_node, new_node in zip( 118 reversed(tuple(original.walk())), reversed(tuple(copy.walk())) 119 ): 120 # We cache the hash of each new node here to speed up equality comparisons. If the input 121 # trees aren't copied, these hashes will be evicted before returning the edit script. 122 new_node._hash = hash(new_node) 123 124 old_node_id = id(old_node) 125 if old_node_id in matching_ids: 126 node_mapping[old_node_id] = new_node 127 128 return node_mapping 129 130 source_copy = source.copy() if copy else source 131 target_copy = target.copy() if copy else target 132 133 node_mappings = { 134 **compute_node_mappings(source, source_copy), 135 **compute_node_mappings(target, target_copy), 136 } 137 matchings_copy = [(node_mappings[id(s)], node_mappings[id(t)]) for s, t in matchings] 138 139 edit_script = ChangeDistiller(**kwargs).diff( 140 source_copy, 141 target_copy, 142 matchings=matchings_copy, 143 delta_only=delta_only, 144 ) 145 146 if not copy: 147 for node in chain(source.walk(), target.walk()): 148 node._hash = None 149 150 return edit_script 151 152 153# The expression types for which Update edits are allowed. 154UPDATABLE_EXPRESSION_TYPES = ( 155 exp.Alias, 156 exp.Boolean, 157 exp.Column, 158 exp.DataType, 159 exp.Lambda, 160 exp.Literal, 161 exp.Table, 162 exp.Window, 163) 164 165IGNORED_LEAF_EXPRESSION_TYPES = (exp.Identifier,) 166 167 168class ChangeDistiller: 169 """ 170 The implementation of the Change Distiller algorithm described by Beat Fluri and Martin Pinzger in 171 their paper https://ieeexplore.ieee.org/document/4339230, which in turn is based on the algorithm by 172 Chawathe et al. described in http://ilpubs.stanford.edu:8090/115/1/1995-46.pdf. 173 """ 174 175 def __init__(self, f: float = 0.6, t: float = 0.6, dialect: DialectType = None) -> None: 176 self.f = f 177 self.t = t 178 self._sql_generator = Dialect.get_or_raise(dialect).generator() 179 180 def diff( 181 self, 182 source: exp.Expression, 183 target: exp.Expression, 184 matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None, 185 delta_only: bool = False, 186 ) -> t.List[Edit]: 187 matchings = matchings or [] 188 pre_matched_nodes = {id(s): id(t) for s, t in matchings} 189 if len({n for pair in pre_matched_nodes.items() for n in pair}) != 2 * len(matchings): 190 raise ValueError("Each node can be referenced at most once in the list of matchings") 191 192 self._source = source 193 self._target = target 194 self._source_index = { 195 id(n): n for n in self._source.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES) 196 } 197 self._target_index = { 198 id(n): n for n in self._target.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES) 199 } 200 self._unmatched_source_nodes = set(self._source_index) - set(pre_matched_nodes) 201 self._unmatched_target_nodes = set(self._target_index) - set(pre_matched_nodes.values()) 202 self._bigram_histo_cache: t.Dict[int, t.DefaultDict[str, int]] = {} 203 204 matching_set = self._compute_matching_set() | set(pre_matched_nodes.items()) 205 return self._generate_edit_script(dict(matching_set), delta_only) 206 207 def _generate_edit_script(self, matchings: t.Dict[int, int], delta_only: bool) -> t.List[Edit]: 208 edit_script: t.List[Edit] = [] 209 for removed_node_id in self._unmatched_source_nodes: 210 edit_script.append(Remove(self._source_index[removed_node_id])) 211 for inserted_node_id in self._unmatched_target_nodes: 212 edit_script.append(Insert(self._target_index[inserted_node_id])) 213 for kept_source_node_id, kept_target_node_id in matchings.items(): 214 source_node = self._source_index[kept_source_node_id] 215 target_node = self._target_index[kept_target_node_id] 216 217 identical_nodes = source_node == target_node 218 219 if not isinstance(source_node, UPDATABLE_EXPRESSION_TYPES) or identical_nodes: 220 if identical_nodes: 221 source_parent = source_node.parent 222 target_parent = target_node.parent 223 224 if ( 225 (source_parent and not target_parent) 226 or (not source_parent and target_parent) 227 or ( 228 source_parent 229 and target_parent 230 and matchings.get(id(source_parent)) != id(target_parent) 231 ) 232 ): 233 edit_script.append(Move(source=source_node, target=target_node)) 234 else: 235 edit_script.extend( 236 self._generate_move_edits(source_node, target_node, matchings) 237 ) 238 239 source_non_expression_leaves = dict(_get_non_expression_leaves(source_node)) 240 target_non_expression_leaves = dict(_get_non_expression_leaves(target_node)) 241 242 if source_non_expression_leaves != target_non_expression_leaves: 243 edit_script.append(Update(source_node, target_node)) 244 elif not delta_only: 245 edit_script.append(Keep(source_node, target_node)) 246 else: 247 edit_script.append(Update(source_node, target_node)) 248 249 return edit_script 250 251 def _generate_move_edits( 252 self, source: exp.Expression, target: exp.Expression, matchings: t.Dict[int, int] 253 ) -> t.List[Move]: 254 source_args = [id(e) for e in _expression_only_args(source)] 255 target_args = [id(e) for e in _expression_only_args(target)] 256 257 args_lcs = set( 258 _lcs(source_args, target_args, lambda l, r: matchings.get(t.cast(int, l)) == r) 259 ) 260 261 move_edits = [] 262 for a in source_args: 263 if a not in args_lcs and a not in self._unmatched_source_nodes: 264 move_edits.append( 265 Move(source=self._source_index[a], target=self._target_index[matchings[a]]) 266 ) 267 268 return move_edits 269 270 def _compute_matching_set(self) -> t.Set[t.Tuple[int, int]]: 271 leaves_matching_set = self._compute_leaf_matching_set() 272 matching_set = leaves_matching_set.copy() 273 274 ordered_unmatched_source_nodes = { 275 id(n): None for n in self._source.bfs() if id(n) in self._unmatched_source_nodes 276 } 277 ordered_unmatched_target_nodes = { 278 id(n): None for n in self._target.bfs() if id(n) in self._unmatched_target_nodes 279 } 280 281 for source_node_id in ordered_unmatched_source_nodes: 282 for target_node_id in ordered_unmatched_target_nodes: 283 source_node = self._source_index[source_node_id] 284 target_node = self._target_index[target_node_id] 285 if _is_same_type(source_node, target_node): 286 source_leaf_ids = {id(l) for l in _get_expression_leaves(source_node)} 287 target_leaf_ids = {id(l) for l in _get_expression_leaves(target_node)} 288 289 max_leaves_num = max(len(source_leaf_ids), len(target_leaf_ids)) 290 if max_leaves_num: 291 common_leaves_num = sum( 292 1 if s in source_leaf_ids and t in target_leaf_ids else 0 293 for s, t in leaves_matching_set 294 ) 295 leaf_similarity_score = common_leaves_num / max_leaves_num 296 else: 297 leaf_similarity_score = 0.0 298 299 adjusted_t = ( 300 self.t if min(len(source_leaf_ids), len(target_leaf_ids)) > 4 else 0.4 301 ) 302 303 if leaf_similarity_score >= 0.8 or ( 304 leaf_similarity_score >= adjusted_t 305 and self._dice_coefficient(source_node, target_node) >= self.f 306 ): 307 matching_set.add((source_node_id, target_node_id)) 308 self._unmatched_source_nodes.remove(source_node_id) 309 self._unmatched_target_nodes.remove(target_node_id) 310 ordered_unmatched_target_nodes.pop(target_node_id, None) 311 break 312 313 return matching_set 314 315 def _compute_leaf_matching_set(self) -> t.Set[t.Tuple[int, int]]: 316 candidate_matchings: t.List[t.Tuple[float, int, int, exp.Expression, exp.Expression]] = [] 317 source_expression_leaves = list(_get_expression_leaves(self._source)) 318 target_expression_leaves = list(_get_expression_leaves(self._target)) 319 for source_leaf in source_expression_leaves: 320 for target_leaf in target_expression_leaves: 321 if _is_same_type(source_leaf, target_leaf): 322 similarity_score = self._dice_coefficient(source_leaf, target_leaf) 323 if similarity_score >= self.f: 324 heappush( 325 candidate_matchings, 326 ( 327 -similarity_score, 328 -_parent_similarity_score(source_leaf, target_leaf), 329 len(candidate_matchings), 330 source_leaf, 331 target_leaf, 332 ), 333 ) 334 335 # Pick best matchings based on the highest score 336 matching_set = set() 337 while candidate_matchings: 338 _, _, _, source_leaf, target_leaf = heappop(candidate_matchings) 339 if ( 340 id(source_leaf) in self._unmatched_source_nodes 341 and id(target_leaf) in self._unmatched_target_nodes 342 ): 343 matching_set.add((id(source_leaf), id(target_leaf))) 344 self._unmatched_source_nodes.remove(id(source_leaf)) 345 self._unmatched_target_nodes.remove(id(target_leaf)) 346 347 return matching_set 348 349 def _dice_coefficient(self, source: exp.Expression, target: exp.Expression) -> float: 350 source_histo = self._bigram_histo(source) 351 target_histo = self._bigram_histo(target) 352 353 total_grams = sum(source_histo.values()) + sum(target_histo.values()) 354 if not total_grams: 355 return 1.0 if source == target else 0.0 356 357 overlap_len = 0 358 overlapping_grams = set(source_histo) & set(target_histo) 359 for g in overlapping_grams: 360 overlap_len += min(source_histo[g], target_histo[g]) 361 362 return 2 * overlap_len / total_grams 363 364 def _bigram_histo(self, expression: exp.Expression) -> t.DefaultDict[str, int]: 365 if id(expression) in self._bigram_histo_cache: 366 return self._bigram_histo_cache[id(expression)] 367 368 expression_str = self._sql_generator.generate(expression) 369 count = max(0, len(expression_str) - 1) 370 bigram_histo: t.DefaultDict[str, int] = defaultdict(int) 371 for i in range(count): 372 bigram_histo[expression_str[i : i + 2]] += 1 373 374 self._bigram_histo_cache[id(expression)] = bigram_histo 375 return bigram_histo 376 377 378def _get_expression_leaves(expression: exp.Expression) -> t.Iterator[exp.Expression]: 379 has_child_exprs = False 380 381 for node in expression.iter_expressions(): 382 if not isinstance(node, IGNORED_LEAF_EXPRESSION_TYPES): 383 has_child_exprs = True 384 yield from _get_expression_leaves(node) 385 386 if not has_child_exprs: 387 yield expression 388 389 390def _get_non_expression_leaves(expression: exp.Expression) -> t.Iterator[t.Tuple[str, t.Any]]: 391 for arg, value in expression.args.items(): 392 if isinstance(value, exp.Expression) or ( 393 isinstance(value, list) and isinstance(seq_get(value, 0), exp.Expression) 394 ): 395 continue 396 397 yield (arg, value) 398 399 400def _is_same_type(source: exp.Expression, target: exp.Expression) -> bool: 401 if type(source) is type(target): 402 if isinstance(source, exp.Join): 403 return source.args.get("side") == target.args.get("side") 404 405 if isinstance(source, exp.Anonymous): 406 return source.this == target.this 407 408 return True 409 410 return False 411 412 413def _parent_similarity_score( 414 source: t.Optional[exp.Expression], target: t.Optional[exp.Expression] 415) -> int: 416 if source is None or target is None or type(source) is not type(target): 417 return 0 418 419 return 1 + _parent_similarity_score(source.parent, target.parent) 420 421 422def _expression_only_args(expression: exp.Expression) -> t.Iterator[exp.Expression]: 423 yield from ( 424 arg 425 for arg in expression.iter_expressions() 426 if not isinstance(arg, IGNORED_LEAF_EXPRESSION_TYPES) 427 ) 428 429 430def _lcs( 431 seq_a: t.Sequence[T], seq_b: t.Sequence[T], equal: t.Callable[[T, T], bool] 432) -> t.Sequence[t.Optional[T]]: 433 """Calculates the longest common subsequence""" 434 435 len_a = len(seq_a) 436 len_b = len(seq_b) 437 lcs_result = [[None] * (len_b + 1) for i in range(len_a + 1)] 438 439 for i in range(len_a + 1): 440 for j in range(len_b + 1): 441 if i == 0 or j == 0: 442 lcs_result[i][j] = [] # type: ignore 443 elif equal(seq_a[i - 1], seq_b[j - 1]): 444 lcs_result[i][j] = lcs_result[i - 1][j - 1] + [seq_a[i - 1]] # type: ignore 445 else: 446 lcs_result[i][j] = ( 447 lcs_result[i - 1][j] 448 if len(lcs_result[i - 1][j]) > len(lcs_result[i][j - 1]) # type: ignore 449 else lcs_result[i][j - 1] 450 ) 451 452 return lcs_result[len_a][len_b] # type: ignore
23@dataclass(frozen=True) 24class Insert: 25 """Indicates that a new node has been inserted""" 26 27 expression: exp.Expression
Indicates that a new node has been inserted
30@dataclass(frozen=True) 31class Remove: 32 """Indicates that an existing node has been removed""" 33 34 expression: exp.Expression
Indicates that an existing node has been removed
37@dataclass(frozen=True) 38class Move: 39 """Indicates that an existing node's position within the tree has changed""" 40 41 source: exp.Expression 42 target: exp.Expression
Indicates that an existing node's position within the tree has changed
45@dataclass(frozen=True) 46class Update: 47 """Indicates that an existing node has been updated""" 48 49 source: exp.Expression 50 target: exp.Expression
Indicates that an existing node has been updated
53@dataclass(frozen=True) 54class Keep: 55 """Indicates that an existing node hasn't been changed""" 56 57 source: exp.Expression 58 target: exp.Expression
Indicates that an existing node hasn't been changed
67def diff( 68 source: exp.Expression, 69 target: exp.Expression, 70 matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None, 71 delta_only: bool = False, 72 copy: bool = True, 73 **kwargs: t.Any, 74) -> t.List[Edit]: 75 """ 76 Returns the list of changes between the source and the target expressions. 77 78 Examples: 79 >>> diff(parse_one("a + b"), parse_one("a + c")) 80 [ 81 Remove(expression=(COLUMN this: (IDENTIFIER this: b, quoted: False))), 82 Insert(expression=(COLUMN this: (IDENTIFIER this: c, quoted: False))), 83 Keep( 84 source=(ADD this: ...), 85 target=(ADD this: ...) 86 ), 87 Keep( 88 source=(COLUMN this: (IDENTIFIER this: a, quoted: False)), 89 target=(COLUMN this: (IDENTIFIER this: a, quoted: False)) 90 ), 91 ] 92 93 Args: 94 source: the source expression. 95 target: the target expression against which the diff should be calculated. 96 matchings: the list of pre-matched node pairs which is used to help the algorithm's 97 heuristics produce better results for subtrees that are known by a caller to be matching. 98 Note: expression references in this list must refer to the same node objects that are 99 referenced in the source / target trees. 100 delta_only: excludes all `Keep` nodes from the diff. 101 copy: whether to copy the input expressions. 102 Note: if this is set to false, the caller must ensure that there are no shared references 103 in the two trees, otherwise the diffing algorithm may produce unexpected behavior. 104 kwargs: additional arguments to pass to the ChangeDistiller instance. 105 106 Returns: 107 the list of Insert, Remove, Move, Update and Keep objects for each node in the source and the 108 target expression trees. This list represents a sequence of steps needed to transform the source 109 expression tree into the target one. 110 """ 111 matchings = matchings or [] 112 matching_ids = {id(n) for pair in matchings for n in pair} 113 114 def compute_node_mappings( 115 original: exp.Expression, copy: exp.Expression 116 ) -> t.Dict[int, exp.Expression]: 117 node_mapping = {} 118 for old_node, new_node in zip( 119 reversed(tuple(original.walk())), reversed(tuple(copy.walk())) 120 ): 121 # We cache the hash of each new node here to speed up equality comparisons. If the input 122 # trees aren't copied, these hashes will be evicted before returning the edit script. 123 new_node._hash = hash(new_node) 124 125 old_node_id = id(old_node) 126 if old_node_id in matching_ids: 127 node_mapping[old_node_id] = new_node 128 129 return node_mapping 130 131 source_copy = source.copy() if copy else source 132 target_copy = target.copy() if copy else target 133 134 node_mappings = { 135 **compute_node_mappings(source, source_copy), 136 **compute_node_mappings(target, target_copy), 137 } 138 matchings_copy = [(node_mappings[id(s)], node_mappings[id(t)]) for s, t in matchings] 139 140 edit_script = ChangeDistiller(**kwargs).diff( 141 source_copy, 142 target_copy, 143 matchings=matchings_copy, 144 delta_only=delta_only, 145 ) 146 147 if not copy: 148 for node in chain(source.walk(), target.walk()): 149 node._hash = None 150 151 return edit_script
Returns the list of changes between the source and the target expressions.
Examples:
>>> diff(parse_one("a + b"), parse_one("a + c")) [ Remove(expression=(COLUMN this: (IDENTIFIER this: b, quoted: False))), Insert(expression=(COLUMN this: (IDENTIFIER this: c, quoted: False))), Keep( source=(ADD this: ...), target=(ADD this: ...) ), Keep( source=(COLUMN this: (IDENTIFIER this: a, quoted: False)), target=(COLUMN this: (IDENTIFIER this: a, quoted: False)) ), ]
Arguments:
- source: the source expression.
- target: the target expression against which the diff should be calculated.
- matchings: the list of pre-matched node pairs which is used to help the algorithm's heuristics produce better results for subtrees that are known by a caller to be matching. Note: expression references in this list must refer to the same node objects that are referenced in the source / target trees.
- delta_only: excludes all
Keep
nodes from the diff. - copy: whether to copy the input expressions. Note: if this is set to false, the caller must ensure that there are no shared references in the two trees, otherwise the diffing algorithm may produce unexpected behavior.
- kwargs: additional arguments to pass to the ChangeDistiller instance.
Returns:
the list of Insert, Remove, Move, Update and Keep objects for each node in the source and the target expression trees. This list represents a sequence of steps needed to transform the source expression tree into the target one.
169class ChangeDistiller: 170 """ 171 The implementation of the Change Distiller algorithm described by Beat Fluri and Martin Pinzger in 172 their paper https://ieeexplore.ieee.org/document/4339230, which in turn is based on the algorithm by 173 Chawathe et al. described in http://ilpubs.stanford.edu:8090/115/1/1995-46.pdf. 174 """ 175 176 def __init__(self, f: float = 0.6, t: float = 0.6, dialect: DialectType = None) -> None: 177 self.f = f 178 self.t = t 179 self._sql_generator = Dialect.get_or_raise(dialect).generator() 180 181 def diff( 182 self, 183 source: exp.Expression, 184 target: exp.Expression, 185 matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None, 186 delta_only: bool = False, 187 ) -> t.List[Edit]: 188 matchings = matchings or [] 189 pre_matched_nodes = {id(s): id(t) for s, t in matchings} 190 if len({n for pair in pre_matched_nodes.items() for n in pair}) != 2 * len(matchings): 191 raise ValueError("Each node can be referenced at most once in the list of matchings") 192 193 self._source = source 194 self._target = target 195 self._source_index = { 196 id(n): n for n in self._source.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES) 197 } 198 self._target_index = { 199 id(n): n for n in self._target.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES) 200 } 201 self._unmatched_source_nodes = set(self._source_index) - set(pre_matched_nodes) 202 self._unmatched_target_nodes = set(self._target_index) - set(pre_matched_nodes.values()) 203 self._bigram_histo_cache: t.Dict[int, t.DefaultDict[str, int]] = {} 204 205 matching_set = self._compute_matching_set() | set(pre_matched_nodes.items()) 206 return self._generate_edit_script(dict(matching_set), delta_only) 207 208 def _generate_edit_script(self, matchings: t.Dict[int, int], delta_only: bool) -> t.List[Edit]: 209 edit_script: t.List[Edit] = [] 210 for removed_node_id in self._unmatched_source_nodes: 211 edit_script.append(Remove(self._source_index[removed_node_id])) 212 for inserted_node_id in self._unmatched_target_nodes: 213 edit_script.append(Insert(self._target_index[inserted_node_id])) 214 for kept_source_node_id, kept_target_node_id in matchings.items(): 215 source_node = self._source_index[kept_source_node_id] 216 target_node = self._target_index[kept_target_node_id] 217 218 identical_nodes = source_node == target_node 219 220 if not isinstance(source_node, UPDATABLE_EXPRESSION_TYPES) or identical_nodes: 221 if identical_nodes: 222 source_parent = source_node.parent 223 target_parent = target_node.parent 224 225 if ( 226 (source_parent and not target_parent) 227 or (not source_parent and target_parent) 228 or ( 229 source_parent 230 and target_parent 231 and matchings.get(id(source_parent)) != id(target_parent) 232 ) 233 ): 234 edit_script.append(Move(source=source_node, target=target_node)) 235 else: 236 edit_script.extend( 237 self._generate_move_edits(source_node, target_node, matchings) 238 ) 239 240 source_non_expression_leaves = dict(_get_non_expression_leaves(source_node)) 241 target_non_expression_leaves = dict(_get_non_expression_leaves(target_node)) 242 243 if source_non_expression_leaves != target_non_expression_leaves: 244 edit_script.append(Update(source_node, target_node)) 245 elif not delta_only: 246 edit_script.append(Keep(source_node, target_node)) 247 else: 248 edit_script.append(Update(source_node, target_node)) 249 250 return edit_script 251 252 def _generate_move_edits( 253 self, source: exp.Expression, target: exp.Expression, matchings: t.Dict[int, int] 254 ) -> t.List[Move]: 255 source_args = [id(e) for e in _expression_only_args(source)] 256 target_args = [id(e) for e in _expression_only_args(target)] 257 258 args_lcs = set( 259 _lcs(source_args, target_args, lambda l, r: matchings.get(t.cast(int, l)) == r) 260 ) 261 262 move_edits = [] 263 for a in source_args: 264 if a not in args_lcs and a not in self._unmatched_source_nodes: 265 move_edits.append( 266 Move(source=self._source_index[a], target=self._target_index[matchings[a]]) 267 ) 268 269 return move_edits 270 271 def _compute_matching_set(self) -> t.Set[t.Tuple[int, int]]: 272 leaves_matching_set = self._compute_leaf_matching_set() 273 matching_set = leaves_matching_set.copy() 274 275 ordered_unmatched_source_nodes = { 276 id(n): None for n in self._source.bfs() if id(n) in self._unmatched_source_nodes 277 } 278 ordered_unmatched_target_nodes = { 279 id(n): None for n in self._target.bfs() if id(n) in self._unmatched_target_nodes 280 } 281 282 for source_node_id in ordered_unmatched_source_nodes: 283 for target_node_id in ordered_unmatched_target_nodes: 284 source_node = self._source_index[source_node_id] 285 target_node = self._target_index[target_node_id] 286 if _is_same_type(source_node, target_node): 287 source_leaf_ids = {id(l) for l in _get_expression_leaves(source_node)} 288 target_leaf_ids = {id(l) for l in _get_expression_leaves(target_node)} 289 290 max_leaves_num = max(len(source_leaf_ids), len(target_leaf_ids)) 291 if max_leaves_num: 292 common_leaves_num = sum( 293 1 if s in source_leaf_ids and t in target_leaf_ids else 0 294 for s, t in leaves_matching_set 295 ) 296 leaf_similarity_score = common_leaves_num / max_leaves_num 297 else: 298 leaf_similarity_score = 0.0 299 300 adjusted_t = ( 301 self.t if min(len(source_leaf_ids), len(target_leaf_ids)) > 4 else 0.4 302 ) 303 304 if leaf_similarity_score >= 0.8 or ( 305 leaf_similarity_score >= adjusted_t 306 and self._dice_coefficient(source_node, target_node) >= self.f 307 ): 308 matching_set.add((source_node_id, target_node_id)) 309 self._unmatched_source_nodes.remove(source_node_id) 310 self._unmatched_target_nodes.remove(target_node_id) 311 ordered_unmatched_target_nodes.pop(target_node_id, None) 312 break 313 314 return matching_set 315 316 def _compute_leaf_matching_set(self) -> t.Set[t.Tuple[int, int]]: 317 candidate_matchings: t.List[t.Tuple[float, int, int, exp.Expression, exp.Expression]] = [] 318 source_expression_leaves = list(_get_expression_leaves(self._source)) 319 target_expression_leaves = list(_get_expression_leaves(self._target)) 320 for source_leaf in source_expression_leaves: 321 for target_leaf in target_expression_leaves: 322 if _is_same_type(source_leaf, target_leaf): 323 similarity_score = self._dice_coefficient(source_leaf, target_leaf) 324 if similarity_score >= self.f: 325 heappush( 326 candidate_matchings, 327 ( 328 -similarity_score, 329 -_parent_similarity_score(source_leaf, target_leaf), 330 len(candidate_matchings), 331 source_leaf, 332 target_leaf, 333 ), 334 ) 335 336 # Pick best matchings based on the highest score 337 matching_set = set() 338 while candidate_matchings: 339 _, _, _, source_leaf, target_leaf = heappop(candidate_matchings) 340 if ( 341 id(source_leaf) in self._unmatched_source_nodes 342 and id(target_leaf) in self._unmatched_target_nodes 343 ): 344 matching_set.add((id(source_leaf), id(target_leaf))) 345 self._unmatched_source_nodes.remove(id(source_leaf)) 346 self._unmatched_target_nodes.remove(id(target_leaf)) 347 348 return matching_set 349 350 def _dice_coefficient(self, source: exp.Expression, target: exp.Expression) -> float: 351 source_histo = self._bigram_histo(source) 352 target_histo = self._bigram_histo(target) 353 354 total_grams = sum(source_histo.values()) + sum(target_histo.values()) 355 if not total_grams: 356 return 1.0 if source == target else 0.0 357 358 overlap_len = 0 359 overlapping_grams = set(source_histo) & set(target_histo) 360 for g in overlapping_grams: 361 overlap_len += min(source_histo[g], target_histo[g]) 362 363 return 2 * overlap_len / total_grams 364 365 def _bigram_histo(self, expression: exp.Expression) -> t.DefaultDict[str, int]: 366 if id(expression) in self._bigram_histo_cache: 367 return self._bigram_histo_cache[id(expression)] 368 369 expression_str = self._sql_generator.generate(expression) 370 count = max(0, len(expression_str) - 1) 371 bigram_histo: t.DefaultDict[str, int] = defaultdict(int) 372 for i in range(count): 373 bigram_histo[expression_str[i : i + 2]] += 1 374 375 self._bigram_histo_cache[id(expression)] = bigram_histo 376 return bigram_histo
The implementation of the Change Distiller algorithm described by Beat Fluri and Martin Pinzger in their paper https://ieeexplore.ieee.org/document/4339230, which in turn is based on the algorithm by Chawathe et al. described in http://ilpubs.stanford.edu:8090/115/1/1995-46.pdf.
181 def diff( 182 self, 183 source: exp.Expression, 184 target: exp.Expression, 185 matchings: t.List[t.Tuple[exp.Expression, exp.Expression]] | None = None, 186 delta_only: bool = False, 187 ) -> t.List[Edit]: 188 matchings = matchings or [] 189 pre_matched_nodes = {id(s): id(t) for s, t in matchings} 190 if len({n for pair in pre_matched_nodes.items() for n in pair}) != 2 * len(matchings): 191 raise ValueError("Each node can be referenced at most once in the list of matchings") 192 193 self._source = source 194 self._target = target 195 self._source_index = { 196 id(n): n for n in self._source.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES) 197 } 198 self._target_index = { 199 id(n): n for n in self._target.bfs() if not isinstance(n, IGNORED_LEAF_EXPRESSION_TYPES) 200 } 201 self._unmatched_source_nodes = set(self._source_index) - set(pre_matched_nodes) 202 self._unmatched_target_nodes = set(self._target_index) - set(pre_matched_nodes.values()) 203 self._bigram_histo_cache: t.Dict[int, t.DefaultDict[str, int]] = {} 204 205 matching_set = self._compute_matching_set() | set(pre_matched_nodes.items()) 206 return self._generate_edit_script(dict(matching_set), delta_only)