A Tip about Using Python for Regular Expressions in T-SQL by Erland Sommarskog

This article first appeared in the SQLServerGeeks Magazine.
Author: Erland Sommarskog
Subscribe to get your copy.

Sometimes when we work with our data in T-SQL, we want to use advanced string matching. This is an area where T-SQL offers very rudimentary support, and if you have been working with regular expressions in other environments – Unix, Perl, C#, your text editor or even SSMS – you soon feel longing for having access to regexps in T-SQL as well. This is particularly pronounced when you want to make replacement operations, since the replace function only works with fixed strings.

Ever since the SQLCLR was introduced in SQL 2005, one option has been to use the Regex class in the System.Text.RegularExpressions namespace. Unfortunately, using the CLR comes with quite a bit of red tape, not the least in recent versions of SQL Server, where all CLR modules by default requires blessing from the DBA. You also need to handle extra deployment steps for the CLR assemblies. If you are already using the CLR in your database, you are likely to already have the procedures in place to handle all that jazz, but if not, you may sense the hurdle to be unsurmountable.

Thankfully, there exists an alternative. In SQL 2016, Microsoft introduced the ability to run scripts in an external language with the system procedure sp_execute_external_script. While this procedure is presented under the moniker Machine Learning Services, it is a general feature that can be used for many programming tasks you want to perform from inside SQL Server, but where T-SQL is lacking in capability. For instance, to use the power of regular expressions. With Machine Learning Services there is no need for extra modules, but you put your external script as a string literal in your T-SQL code, making it a lot easier to use.

As of SQL 2019, Microsoft supplies support for three external languages: R (a specialised statistics language), Python and Java. We will work with Python, which is supported from SQL 2017 and on. (Beware that if you are using Azure SQL Database, it has no support for running external scripts at all. Or using the CLR for that matter.)

To be able to use Python in your SQL scripts, there conditions must be fulfilled:
1. The support for Python must be installed.
2. The server configuration option external scripts enabled must be set to 1.
3. The user must have the database permission EXECUTE ANY EXTERNAL SCRIPT.

When it comes to the first point: Machine Learning Services is optional when you install SQL Server, and you can also select which languages to install, as seen in this screenshot from the Setup Wizard:

In this particular example, I am about to add Python support to an existing instance of SQL 2019.

I would recommend that you always install support for at least one language, even if you have no interest in machine learning, because this gives you a tool that makes your SQL Server installation more powerful. As for which language to select, that is a matter of preference.

When it comes to the second step, enabling the server configuration option, I need to add a word of caution for the server-level DBA. All these languages support reading and writing files from and to the file system, and this can potentially be a security risk. Now, Microsoft has taken some mitigating actions here. External scripts do not run in the same process as SQL Server itself, but in a so-called Launchpad process with limited permissions. Nevertheless, you should be careful if you server is sensitive, and not the least if there are persons who have elevated permissions on database level, for instance members of db_owner and thereby implicitly have permissions to run external scripts. For more details on security for Machine Learning Services, see the Docs topic Security architecture for the extensibility framework in SQL Server Machine Learning Services.

When it comes to the last point, the database-level permission I will return to that at the end. For now, I will assume that you are master in your own house, that is you are sysadmin or your own server, so that you can run the demos.

Now, after all these proceedings, let’s look at some examples. Here is a test table with some data:

CREATE TABLE Playdata (id   int            NOT NULL,
                       txt  nvarchar(200)  NULL,
     CONSTRAINT pk_Playdata PRIMARY KEY (id)
) 
INSERT Playdata(id, txt)

VALUES (1, N'This is a text with a number 70491spliced in.'),
          (2, N'This text has 8908 many numbers656in it 988.'),
          (3, N'And it is on7777ly getting worse.'),
          (4, N'951357'),
          (5, N'There are no numbers at all in this text.'),
          (6, N'(503) 555-7555') 

Say that we want to delete all characters that are not digits. As for why we want to it, the last line gives an example: it could be phone numbers. Here is a full script for the task:

DECLARE @python_script nvarchar(MAX) = N'
import re, pandas
Data["txt"] = pandas.Series(
    [re.sub(r"[^0-9]", r"", str) for str in Data["txt"]])
' 
DROP TABLE IF EXISTS #result
CREATE TABLE #result(id   int           NOT NULL PRIMARY KEY,
                     txt  nvarchar(200) NULL)
INSERT #result (id, txt)
   EXEC sp_execute_external_script @language = N'Python', 
        @input_data_1 = N'SELECT id, txt FROM Playdata', 
        @input_data_1_name = N'Data',
        @output_data_1_name = N'Data',
        @script = @python_script

UPDATE Playdata
SET    txt = r.txt
FROM   Playdata P
JOIN   #result r ON P.id = r.id
WHERE  isnull(P.txt, '') <> isnull(r.txt, '')

SELECT id, txt FROM Playdata

The Python script is on top of the T-SQL script, but let’s look at the rest first.

sp_execute_external_script receives an input data set, and returns a result set. Since the Python script runs in a separate process you cannot (easily) update the source data from the Python script. Instead, we capture the result set with INSERT-EXEC and perform an UPDATE of the Playdata table after the call to sp_execute_external_script.

sp_execute_external_script takes a number of parameters. The first, @language, specifies the language of the script. @input_data_1 specifies the input data set, and you always do this through a T-SQL query. The parameter @input_data_1_name specifies the name of the Python variable to hold the input data set, and @output_data_1_name specifies the name for the Python variable that will hold the output result set from the script. In our case, we want to traverse a data set and potentially modify all rows in it, so it seems natural to use the same variable for input and output.

Let’s now turn to the Python script itself. Aesthetically oriented readers might think that it would be prettier if the Python code was intended to distinguish it from the surrounding T-SQL, but Python is a quite a special language in this regard: In Python indentation is syntactically significant, so statements must start in the first column on the outermost level.

The script only has two statements. The first statement is this line:

import re, pandas

This line imports the two packages that we need. re is the package for regular expressions, whereas pandas is a package you always need to import in Python scripts you call from SQL Server. The data type of the Data variable is pandas.DataFrame.

The second statement is split up over two lines and this is where all the action is. (Note that the normal way of splitting up a statement over several lines in Python is to put the line break somewhere in an open parenthesis.) There seem to be a lot here, but the good news is that you don’t need to understand all. To start to the left, we seem to be assigning something to Data[“txt”]. Data is indeed the name we specified in the parameters @input_data_1_name and @output_data_1_name, and txt is the column from our Playdata table, our input data set.

There is no need to concern us about what pandas.Series does. Suffice to say that it performs some magic that permits us to update the txt column in the Data variable. The argument to pandas.Series is a list comprehension, as it is known as in Python. All you need to know is that it will traverse all rows in Data and put the value of the column txt in the local variable str and then call re.sub for each row, and that brings to the heart of the matter, to wit, the actual string substitution:

re.sub(r"[^0-9]", r"", str)

The first argument is what we are looking for. In this example, it is a very simple regular expression which says “exactly one character that is not a digit 0 to 9”. You may notice that this is exactly the same meaning as with LIKE in T-SQL. The second parameter is the text to replace the match, and this case we just want to remove the matching character, so we pass the empty string. Observe that both string literals are preceded by r to signify what Python calls a raw string. In this example it is of no importance, but it will be in the next, and you should always put this r before the string literals you pass to re.sub.

Here is the output from the script:

You may note that for row 5 where there are no digits at all in the text, we get back NULL rather than the empty string.

   

Let’s look at a second example, working with the same data. This time want to insert spaces around numbers that are in the middle of a word. This is how the Python script looks this time:

DECLARE @python_script nvarchar(MAX) =  N'
import re, pandas
Data["txt"] = pandas.Series(
    [re.sub(r"([a-z])(\d+)([a-z])", r"\1 \2 \3", str) for str in Data["txt"]])
'
And if you look closely, the only thing that has changed is the call to re.sub: 
re.sub(r"([a-z])(\d+)([a-z])", r"\1 \2 \3", str)

This time, the search pattern is more complex. We will ignore the parentheses for the time being to come back to them later. The first component is [a-z], which matches one single lowercase character in the range a to z. (Again, reminiscent of what we have in T-SQL.) Next is \d. This is a short form for [0-9], that is it matches one single digit. Next comes the plus, +, which means “previous regular expression one or more times”. That is, the pattern \d+ matches any sequences of digits – something you cannot express in T-SQL at all. Thus, meaning of the full search pattern to re.sub is “starts with one lowercase character, followed by one or more digits and ends with another lowercase character”.

What about all these parentheses? Parentheses in regular expression can, as elsewhere in computing, be used to override precedence order, but they also serve to define groups which you can refer to with \1 for the first group, \2 for the second group and so on. We use these groups in the replacement string. More precisely, \1 refers back to that first matching lowercase character, \2 to the sequence of digits and \3 to the final lowercase character in the match.

This is where this r starts to matter. Without it, Python would make its own interpretation of those backslashes, and you would not get the expected result. And backslashes are very common with regular expressions. Whence the recommendation to always mark your arguments to re.sub as raw with r.

This is the output from the script:

Probably, we would like to have a space after 70491 in the first row as well, but I wanted to keep the example decently simple. As they say, an exercise left to the reader.

For the last example, we are going to work with this table:

CREATE TABLE Names (id       int           NOT NULL, 
                    fullname nvarchar(100) NOT NULL,
                    CONSTRAINT pk_Names PRIMARY KEY (id)
)
INSERT INTO Names (id, fullname) 
   VALUES (1, 'Big Business Inc.'),
          (2, 'Mary Jones Sr.'),
          (3, 'English Drinks Ltd'),
          (4, 'Frank Hinc'),
          (5, 'Dr. Michael Keen'),
          (6, 'Mr. John King Sr.') 

What we want to do is to remove titles and suffixes, including the periods, to get only the core names. To this end, we have a table with the strings to remove:

CREATE TABLE Titles (title nvarchar(50) NOT NULL 
           CONSTRAINT pk_Titles PRIMARY KEY(title)
)
INSERT INTO Titles (title)
VALUES ('inc'), ('ltd'), ('Mr'), ('Dr'), ('Sr')

What is different this time is that the input pattern will come from our Titles table Also, this time we will not bother about updating the source table, but we will return a result set with the simplified names. Here is the full script:

DECLARE @pattern nvarchar(MAX)
SELECT @pattern = string_agg('(\s*\b' + title + '\b\.*\s*)', '|') 
FROM   Titles
SELECT @pattern
DECLARE @python_script nvarchar(MAX) = N'
import re, pandas
Data["fullname"] = pandas.Series(
    [re.sub(Pattern, r"", str, flags=re.IGNORECASE) 
        for str in Data["fullname"]])
'
EXEC sp_execute_external_script @language = N'Python', 
     @input_data_1 = N'SELECT id, fullname FROM Names', 
     @input_data_1_name = N'Data',
     @output_data_1_name = N'Data',
     @script = @python_script,
     @params = N'@Pattern nvarchar(MAX)',
     @Pattern = @pattern
WITH RESULT SETS ((id        int           NOT NULL,
                   fullname  nvarchar(100) NOT NULL))

Let’s start from the bottom and work our way upwards. Because this time we are returning data, we add the clause WITH RESULT SETS to the call to sp_execute_external_script, in order to name our columns. Without this clause, the return columns would be nameless. The clause also serves as an assertion that the result set has the shape we intend it to have.

The call to sp_execute_external_script this time has two more parameters. The @params parameter specifies parameters to pass to the script with name and type. If you have used sp_executesql, you may recognise this pattern. @Pattern is exactly that parameter we defined inside @params. Observe that even though we are using named parameters throughout, @params must come last of all the fixed parameters to sp_execute_external_script, and the parameters defined inside @params must follow @params.

The Python script follows the same pattern as before, but rather than passing a string literal for the search pattern, we pass the variable Pattern, which we have populated in the T-SQL script. As in the first example, the replacement string is the empty string. We also pass a flag to re.sub to state that we want case-insensitive matching.

Finally, on the top of the script we build the search pattern with help of the string_agg function, and there is also a SELECT to show the resulting pattern, here split up over multiple lines to fit the page width:

(\s*\bDr\b\.*\s*)|(\s*\binc\b\.*\s*)|(\s*\bltd\b\.*\s*)|
(\s*\bMr\b\.*\s*)|(\s*\bSr\b\.*\s*)

I will only point out the essentials, and leave the full interpretation of the regexp to the reader.

\s stands for any white space. * means “previous regular expression zero or more times. \b stands for word break, that is, transition from non-alphanumeric character to alphanumeric or vice versa. The \b is needed so that we don’t get a match in Drinks or Hinc. \. stands for the dot itself. The backslash is needed as an escape, since a dot on its own in a regular expression means “any character”. (Generally, in regular expressions, you should always escape punctuation characters, because even if they have no meaning today, they may acquire one in the future.) Finally, the bar | stands for alternate. That is, the pattern A|Z matches A or Z.

This is the output from the script:

Before I draw this to a close, I would like to make two more remarks to compare this approach with using the CLR. The first is about performance. If you ran the examples, you may have noticed that there sometimes was a noticeable delay. This is due to the time to start up the launchpad process. For this reason, the approach with an external script is less preferrable than a CLR solution if you run replacement operations on a smaller number of rows and you want immediate response. But if you are running a data-cleansing task over megabytes of data, this start-up time is not likely to be a concern to you.

The second remark is about the permission EXECUTE ANY EXTERNAL SCRIPT. If you are the DBA who yourself are running a data clean-up task, or you work in a staging database with full permissions, this permission is of no concern. But if you put this into application code, you would have to arrange so that users get this permission, which your DBA may or may not agree to. Here is an advantage for the CLR: while there is work get the assembly loaded and trusted on the server, users need no particular permission to run a CLR function or a CLR stored procedure. However, this permission issue is not a total roadblock for external scripts. To wit, it is possible to package a permission inside a stored procedure, and this is something I hope discuss in my tip in the next issue of the SQLServerGeeks Magazine.

Finally, some links for further reading:
– From the SQL Docs: Quickstart: Run simple Python scripts with SQL machine learning followed by the section Quickstart: Data structures and objects using Python with SQL machine learning.
– From the Python docs: re — Regular expression operations, the reference page for the re module.
And for the truly curious who want to understand everything, from the pandas documentation: Intro to data structures.

This article first appeared in the SQLServerGeeks Magazine.
Author: Erland Sommarskog
Subscribe to get your copy.

   

Leave a Reply

Your email address will not be published.