Regular Expression Filter Help

About This Help
This is a brief explanation of the Regular Expression filter type. For more detailed instructions and use examples, please see the user manual.

Use
A Regular Expression filter is used to search, modify, and replace text using Perl compatible regular expressions. A regular expression is a type of pattern for matching complex and variable text and, optionally, parsing matched text into parts which can be used in later processing.

Use Example
Let's say you wanted to find every line with an equal sign and flip the text on either side of the equal sign. So, for example, A = B would become B = A. The following regular expression matches any line where an equal sign separates the line into two and, further, understands the parts of the line:
([^\r\n]+)( = )([^\r\n]+)

The following replace pattern re-arranges the parts of the line:
\3\2\1

If you wanted to flip the sides and convert the = sign to !=, you could use the following replace pattern:
\3 != \1

How It Works
Regular expressions are composed according to the syntax listed below. The regular expression syntax is both rich and complex, and there are numerous references available which describe complex operations using regular expressions.

PatternDescription
.Matches any character except newline.
[a-z0-9]Matches any single character of set.
[^a-z0-9]Matches any single character not in set.
\dMatches a digit. Same as [0-9].
\DMatches a non-digit. Same as [^0-9].
\wMatches an alphanumeric (word) character -- [a-zA-Z0-9_].
\WMatches a non-word character [^a-zA-Z0-9_].
\sMatches a whitespace character (space, tab, newline, etc.).
\SMatches a non-whitespace character.
\nMatches a newline (line feed).
\rMatches a return.
\tMatches a tab.
\fMatches a form feed.
\bMatches a backspace.
\0Matches a null character.
\000Also matches a null character because of the following:
\nnnMatches an ASCII character of that octal value.
\xnnMatches an ASCII character of that hexadecimal value.
\cXMatches an ASCII control character.
\metacharMatches the meta-character (e.g., \, ., |).
(abc)Used to create subexpressions. Remembers the match for later backreferences. Referenced by replacement patterns that use \1, \2, etc.
\1, \2,…Matches whatever first (second, and so on) of parentheses matched.
x?Matches 0 or 1 x's, where x is any of above.
x*Matches 0 or more x's.
x+Matches 1 or more x's.
(x+?)Turns greediness off so that the minimum number is matched before moving to the next part.
x{m,n}Matches at least m x's, but no more than n.
abcMatches all of a, b, and c in order.
a|b|cMatches one of a, b, or c.
\bMatches a word boundary (outside [] only).
\BMatches a non-word boundary.
^Anchors match to the beginning of a line or string.
$Anchors match to the end of a line or string.

In addition, TextSpresso supports the following syntax in regular expressions for looking both ahead and behind a match.

PatternComment
(?:pattern)For grouping without creating backreferences
(?=pattern)A zero-width positive look-ahead assertion. For example, \w+(?=\t) matches a word followed by a tab, without including the tab in $&.
(?!pattern)A zero-width negative look-ahead assertion. For example foo(?!bar)/matches any occurrence of "foo" that isn't followed by "bar".
(?<=pattern)A zero-width positive look-behind assertion. For example, (?<=\t)\w+ matches a word that follows a tab, without including the tab in $&. Works only for fixed-width look-behind.
(?<!pattern)A zero-width negative look-behind assertion. For example (?<!bar)foo matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind.

At this time, TextSpresso only supports the \nn and \name syntax for specifying parts of a pattern to be used in the replace. In addition, you can specify the entire match with the & character. Using an additional \, you can include a literal '&', '\n', or '\name'. For example, given the find pattern (a+)(b)(c) matched against text aaabc:

PatternResultExplanation
\1\3aaacBuilds the replace from the 1st and 3rd parts of the found text.
\1 & \3aaa aaabc cIncludes two spaces and the entire match in the middle.
\\1\&\2\1&bThe \ is used to escape \1 and & so that they are included literally. \2 matches the second part of the pattern.

How To Build It
Enter the text you want to find in the Find Pattern field. The search text must conform to the specifications of the regular expression syntax. Note that in TextSpresso you can use the String Code Editor to enter UTF8 characters in addition to the regular expression syntax which allows you to enter a character's octal or hexadecimal code.

Enter the replacement text in the Replace Pattern field. Note that while regular expressions are fairly standard across software, replace patterns tend to vary in syntax and functionality. At this time TextSpresso only supports the syntax and features specified above.

If you will not be using the match or any of its parts in the replacement, you can speed up the operation by checking Simple replace? Simple replace uses the same optimized replace engine as TextSpresso's other filters and can process large amounts of text much faster. With simple replace turned on:

PatternComment
*Retains the character (not byte) in the same position from the matched text. No character if there is no matched text in the same numerical position.
\*Includes a literal * in the replacement text.
\\Includes a literal \ in the replacement text.

About The Text Fields
Text fields for text used by the filter are specially designed to support the display and editing of all characters. This includes control characters which are not editable in a normal field. Control characters are displayed as UTF codes surrounded by dashes. For example, a line feed character is displayed as "-UTF10-".

You can precisely insert, view, and edit any character in Unicode by clicking the label above a field. This will display the String Code Editor. The String Code Editor displays all characters in the field and their UTF codes in a grid. Using this grid you can type characters directly or indirectly by entering Unicode values. By displaying and allowing you to edit the UTF codes, the String Code Editor makes it possible to include and edit characters which cannot be displayed in the System font and/or cannot normally be displayed (i.e. control characters).

Regular Expressions vs. Pattern Filters
Regular expressions and TextSpresso patterns can both match complex, variable text. TextSpresso patterns were originally designed to be easy for a novice to understand, predict, enter, and use. They therefore have a simple syntax with no characters that have a special meaning, and they are edited using a graphical editor. Novice users don't have to remember any special rules or look up specific syntax to build their patterns.

Regular expressions are the opposite in that they have a complex syntax where many characters have a special or double meaning, and they are entered without the assistance of a graphical editor. If you're used to the regular expression syntax, you can enter a pattern faster by typing it than by using a TextSpresso pattern and its graphical editor. But even expert users are sometimes unable to predict exactly how a regular expression will match.

From a functionality stand point, regular expressions can do things that normal TextSpresso patterns cannot, such as parsing a match into sub parts and using those parts in the replacement. But TextSpresso patterns are often faster for searching and replacing, though this is only noticeable with large amounts of text. TextSpresso patterns are also used by some filters, such as the Sort filter type.

You should feel free to use whatever filter type/pattern matching you feel most comfortable with. That's why TextSpresso offers both engines.

Notes
For regular expression searching, TextSpresso uses a modified version of the PCRE library, which is open source software written by Philip Hazel, and copyright by the University of Cambridge, England. You can learn more about PCRE and download the source code at: http://www.pcre.org/

Special thanks to Philip Hazel and the University of Cambridge for making this library available as open source under the BSD license.