Regular expression to extract SQL query

筅森魡賤 提交于 2019-11-29 08:21:45

I'll start off by saying that this is not a good way of doing it, and strongly urge you to find another method of doing it, preferrably tagging it properly where the statements are made, so you don't end up in this situation.

That being said, SQL requires it to start with one of the following; DELETE, SELECT, WITH, UPDATE or INSERT INTO. It also requires that the input ends with ;.

We can use this to grab all sequences matching SQL with the following:

final String regex = "^(INSERT INTO|UPDATE|SELECT|WITH|DELETE)(?:[^;']|(?:'[^']+'))+;\\s*$";
final Pattern p = Pattern.compile(regex, Pattern.MULTILINE | Pattern.DOTALL);

Group 1 now holds the operating word, in case you wish to filter valid SQL on UPDATE or SELECT.

See the regex in action, as well as a cave-at here:

https://regex101.com/r/dt9XTK/2

You can match it "properly" as long as the semicolon is the last non-whitespace character on that line.

final String regex = ^(SELECT|UPDATE|INSERT)[\s\S]+?\;\s*?$

final Pattern p = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = p.matcher(content);

(?m)^(UPDATE|SELECT|INSERT INTO).*;$ should work. This would extend the pattern to match over newlines. It should be able to loop through and find all your SQL.

Looking at the example you provided it will match your commands until the ;. You can see the example used for testing here.

If you're dealing with a language, create a lexer that tokenizes your string. Use JFlex, which is a lexical analyzer generator. It generates a Java class that splits a string into tokens based on a grammar specified in a special file. Take the relevant grammar rules from this file.

Parsing is a separate process than tokenization (or lexical analysis). You might want to use a parser generator, after lexical analysis, if lexical analysis is not enough.

SQL is complicated enough that you will need context to find all statements, meaning that you can't do this with a regular expression.

For example:

SELECT Model FROM Product
WHERE ManufacturerID IN (SELECT ManufacturerID FROM Manufacturer 
WHERE Manufacturer = 'Dell')

(example comes from http://www.sql-tutorial.com/sql-nested-queries-sql-tutorial/). Nested queries can be nested multiple times, start with different values, etc. If you could write a regular expression for the subset you are interested in, it would be unreadable.

ANTLR has a SQL 2003 grammar available (I haven't tried it).

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!