BB eBooks Banner

Regular Expression Crash Course for eBooks

The eBook Design and Development Guide

eBook Design Development Guide Cover

For complete access to all the templates, tips, and tricks that BB eBooks uses for eBook production, please consider buying the eBook Design and Development Guide at Amazon for only $6.99. In it you will find comprehensive HTML, CSS, and Regular Expression tutorials, as well as a step-by-step workflow for turning a sloppy manuscript into a beautiful eBook that is only available in this guide. A PDF version is available upon request following purchase.


Index


What are Regular Expressions?

When working with large amounts of text, such as the HTML source for an eBook, it is advantageous to be able to find and replace specific strings of text in your document. For example, if you wanted to replace every single straight up and down apostrophe (i.e. ') with a fancy apostrophe (i.e. ), you could use a simple “find n’ replace” tool like the one utilized in Microsoft Word or Open Office to get the job done. However, what if you wanted to replace a fancy apostrophe only on apostrophes that were within a word (i.e. don’t, can’t, etc.) Hmm…this would not be possible using a simple find and replace tool. You might consider cycling through every single apostrophe and checking to see if it should be changed to a fancy apostrophe. However, this would be time-consuming and you would most likely make a mistake or two along the way.

A solution that can find and replace strings utilizing logic is regular expressions. These are powerful tools to ensure that you can find and replace specific pieces of markup and content in your HTML source file. The three main benefits of using regular expressions as part of your workflow in eBook design are as follows:

  • Save time
  • Make less mistakes
  • Prevent repetitive boredom and insanity

Regular expressions, or “regexes”, have been used by programmers for years as part of the development process for user interfaces and for coding. They are readily available in most text editor programs, such as Notepad++, so you don’t need any special software to use them. However, the use of regular expressions and its documentation can be extremely complicated, since they are almost exclusively part of the geek community. Trying to read the Notepad++ documentation on regular expressions, for example, can be a challenge unless you have a background in computer science, programming, and/or web development. Therefore, this guide will look at the basics of regular expressions and try to apply them to some practical eBook development solutions. We will avoid going too far into nerd territory. Although, you should feel free to explore more about regular expressions, since they are incredibly powerful and underutilized.

Important Note: This guide will use Notepad++ to explain regular expressions, but you probably have regex support if you are using another text editor that is of suitable quality for eBook production.


The Find and Replace Window

If you’ve been working on designing your eBook, you have probably used the Find and Replace Window already. You can access it by pressing CTRL+F to just find text or CTRL+H to find and replace text. To get started with regular expressions, make sure that the radio button labeled Regular expression is checked in the Find Window or the Find and Replace Window.

Normally when you use the Find Window, you type in a character or word in the Find what: field and click Find Next. The text editor then goes through the document looking for exactly what you typed into the find field. You can also press F3 to find the next match, or SHIFT+F3 to find the previous match.

For our regular expression tutorial, let’s say you have the following text document on one line in your text editor:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Hypothetically, you want to search for the letter p. You type p in the Find what: field, click Find Next, and it cycles through all the “p”s in this line of text, selecting them one-by-one as indicated with the underlined text below:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

This is pretty shake and bake stuff, and you probably already know this. So let’s start learning about how regular expressions actually work.


Multiplying Operators: ?, +, and *

Multiplying operators are special characters reserved for regular expressions that instruct the text editor to find zero, one, or multiples of any character. For example, if you wanted to search for p or pp at the same time, you can use the multiplying operator ? to create a regular expression. The ? operator instructs the previous character in the regular expression that it must be in the matching text zero or one times. So by typing pp? in the find field, the text editor will match any p or pp combination.

You can use the multiplying operator ? in any part of the regular expression. For example, if you typed po?p in the find field, your text editor would match pp and pop as indicated by the underlined portion of the sample text below:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

If you want to search for p-character-p, where character has to be in the text one or more times, you can use the multiplying operator +. Therefore, if you typed po+p in the find field, your text editor would match pop, poop, pooop, and so on, but not pp. Below is an example of what your text editor would match by typing po+p in the find field:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

The multiplying operator * is used in a similar fashion as ? and +; however, it instructs the preceding character to be matched by the text editor zero or any number of times. Therefore, if you typed po*p in the find field, it would match pp, pop, poop, pooop, etc. Below is an example of the underlined matches:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Tip: For the preceding examples, we used case-insensitive regular expression matching. To turn on case-sensitive matching, click on the Match case checkbox in Notepad++.

If you want to match a character a specified number of times, you can use curly braces notation immediately after a character with a lower and upper limit. For example, typing in the regular expression po{3,4}p would match pooop and poooop only.

The multiplying operators and their curly braces equivalents are as follows:

  • x? is equivalent to x{0,1} and matches x zero or one times
  • x+ is equivalent to x{1,} and matches x one or more times
  • x* is equivalent to x{0,} and matches x zero or more times

Important Note: Some text editors utilize a different syntax for regular expressions; however, the principle is essentially the same, even for higher-level programming languages like JavaScript and Python. Consult your help menu for syntax details if you are not using Notepad++.


Matching Groups of Characters and Special Characters

Rather than typing in combinations of specific characters in the find field, it can be advantageous to search for different groups of characters, or any character, all with one regular expression. To search for any character (a letter, digit, space, and everything else) you can type . in the find field. If you just type a . by itself, it will go one by one through every character in your text. This is not very useful. However, if you type p.p in the find field, this regular expression will match p-any character-p such as p1p, p p, pdp, etc. Trying to use the regular expression p.p on our example will match the following:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

If you want to match any number of characters between a p and an a, you can type the regular expression p.*a. However, this may give you strange results, because most text editors (including Notepad++) exhibit what is called greedy behavior. This means the .* part of the regular expression will go as far as it can on the line of your text editor to find the next a after it finds the p.

Typing p.*a in the find field matches the following longer than expected match:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

To limit how far the * multiplying operator will search during the find process, you can add the ? multiplying operator on the end of the .*. This limits the behavior of the * so that after finding a p, the .* will stop at the first a that it sees in your text.

Typing in p.*?a in the find field will have the following three matches:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Occasionally, it may be necessary to match special characters reserved for regular expression functionality within your text editor (i.e. *, ?, ., etc.) If you want to actually find a period (i.e. .), you need to place a backslash in front of the . by typing \.. The backslash tells the text editor that the following character is to be part of the literal regular expression and not to be used as the any character equivalent.

By typing \. In the find field, you will match both periods in our example as follows:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

The following special characters all need to be escaped with a backslash (i.e. \) if you want to match them literally in your text editor: ^, $, (, ), ., |, ?, +, *, [, ], {, }, -, and \. These characters all have special meaning to regular expressions, so if you want to find all the carets in your text, you would type \^ in the find field, as an example.

In some cases, you may want to match types of characters like whitespace, digits, and lower-case letters in your text. Rather than writing out different regular expressions for each specific character, there are reserved characters that can perform this function in one pass.

For example, if you wanted to find any digit, you could use the character \d. This will find any digit 0 through 9. Try typing \d in the find field to locate the lone digit in our example:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

A comprehensive list of other kinds of characters that can be utilized as part of your regular expression arsenal are as follows:

  • \d – matches a digit
  • \D – matches a character that is not a digit
  • \l – matches a character that is a lower-case letter
  • \L – matches a character that is not a lower-case-letter
  • \u – matches a character that is an upper-case letter
  • \U – matches a character that is not an upper-case letter
  • \w – matches a character that is a digit, letter, or underscore
  • \W – matches a character that is not a digit, letter, or underscore
  • \t – matches a tab
  • \n – matches a line feed
  • \r – matches a carriage return
  • \s – matches a character that is used for spacing (tabs, new lines, and spaces)
  • \S – matches a character that is not used for spacing

Important Note: For end-of-line characters (aka EOL), Windows-based text editors use a carriage return followed by a line feed (i.e. \r\n), while Unix and Mac-based text editors use a line feed (i.e. \n). These characters are important so that the text editor knows where lines end.


Sets

Sets are a way to search for a group of characters in one pass that may not be covered by the reserved characters for groups discussed above. Hypothetically, if you wanted to search for the letter p or i, you would have to use the find function twice. However, if you use sets as part of your regular expression, you can make the text editor look for p or i in one pass.

The brackets [ and ] enclose what is known as a set, and you can put any combination of words, numbers, and special characters in a set. For example, try typing [ip] in the find field. The text editor will go through the document and match every i and p as follows:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

You can also use sets to provide a range of numbers or letters by using the special character - (i.e. the hyphen). For example, typing [a-c] as your regular expression will find the letters a, b, and c as shown in the following example:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

If you want to search for characters not in a set, you add the special character ^ (i.e. the caret) before the characters within the set’s brackets. For example, try typing [^a-z] in the find field. The text editor will go through the document and match the following:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Tip: If you want to match uppercase characters for the regular expression [^a-z], you can tick the Match case checkbox in the find window.

If you wanted to not match the spaces in your text editor, you could include the reserved character \s as part of the set in your regular expression. For example, typing [^a-z\s] in the find field will match the following:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Using Anchors

Anchors can be utilized in regular expressions to allow matches only in certain situations such as the beginning of a word or the beginning of a line. If you wanted to search for entire words that just started with the letter p, you could use the regular expression \<p. The \< instructs the text editor to only look for matches at the beginning of words. Likewise, if the \> is part of your regular expression, the text editor will only look for matches at the end of words.

Try typing the regular expression \<p in the find field, and the text editor will go through the document and match the following:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Now, try typing e\> in the find field to match any word ending in e. The text editor will go through the document and find the following matches:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

To find any words that start with a p and end with an e, you may be inclined to type \<pe\>. This will only find the string pe. Instead, try typing \<p[a-z]*?e\> in the find field. It will match the following in our example:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Let’s analyze this regular expression by its parts. The \<p term is looking for any word that starts with p. Then, the [a-z]*? is looking for zero or more letters in a non-greedy pattern match. Finally, the e\> means the word must end with an e.

You may have noticed that when you search for words starting with p by using the regular expression \<p, the text editor only matched the p by itself. However, if you want to select an entire word that starts with the letter p, try typing \<p[a-z]* in the find window.

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Say you want to get real crafty and find entire sentences. You could type the regular expression [a-z0-9 ]*?\.. This will look for any combination of word characters, numbers, and spaces, and match characters until it hits the first . (i.e. the period). Don’t forget that the ? after the * turns ensures the regular expression has non-greedy behavior. Typing in [a-z0-9 ]*?\. will have two matches in our example as follows:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

However, not all sentences end in a period, because sometimes they end in a question mark or a closing quotation. Also, many sentences have apostrophes in them. You can adjust your regular expression to incorporate these different scenarios. Try using the regular expression [a-z0-9' ]*?[\.\? "] to match different sentences throughout your manuscript.

Tip: Regular expressions can get very complex very quickly when you try to account for different situations. Try to keep them as simple as possible and consider using multiple regular expressions on separate passes rather than one monstrous one that accounts for all scenarios.

Anchors can also be used to see if a condition exists at the beginning or end of a line. The ^ character specifies that the regular expression should only be applied at the beginning of a line, and the $ character specifies that the regular expression should only be applied at the end of a line. For example, the regular expression ^p would match the p in any line that started with p, and ^the would match the the in any line that started with the word the. Likewise, the regular expression \?$ would match the ? in any line that ended with a question mark.

For eBook design, you often want to select entire lines of text when you are marking up your content with HTML. To match every line of text one by one, try typing the following ^.+$:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

The ^ in the regular expression starts the match at the front of the line in the text editor. Likewise, the $ in the regular expression matches the end of the line in the text editor. The .+ matches one or more of any character (i.e. any line that is not blank). This is a useful way to grab entire lines of text, and you can use substitutions, which are discussed below, to quickly add HTML markup.

Important note: The ^ as an anchor should not be confused with the ^ inside of a set (e.g. [^a-m]), because within a set it means no characters between a and m rather than matching the beginning of a line. Yes, we know that this can be rather confusing.


Alternation and Assertions

If you wanted to match either the letter p or o, you have already learned about sets, and the simple regular expression [po] would be suitable. However, what if you wanted to match either Pope or Paul? This would not be possible utilizing a set, since they only look at single characters. As an alternative you can use the | character (i.e. the pipe character) to provide alternative expressions for possible matches with the syntax (expression 1)|(expression 2).

The regular expression to search for either Pope or Paul would be (pope)|(paul). This will matches the following in our example:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Important Note: Ensure that you use parentheses when grouping characters so the text editor knows how to process the regular expression.

One complicated, but very useful, way to use regular expressions is by utilizing lookahead and lookbehind assertions. These allow you to employ powerful logic to your matches and many text editors will not support this feature (Notepad++ only began doing so with their release in 2012). Every time a match is made during the process, you can use an assertion as part of your regular expression to provide a test of the characters after it (in the case of a lookahead assertion) or the characters behind it (in the case of a lookbehind assertion).

As an example, say that you wanted to match the letter p, but only if the following character was an o. The regular expression would be as follows p(?=o):

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Important Note: Notice that the o is not actually part of each match. Assertions are only a form of logic and do not actually match characters.

If you wanted to match any p that had an a before it, you could try typing the regular expression (?<=a)p:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

It is also possible to use negative assertions, which means that a character will match only if a condition is not met. For example, if you wanted to match every p that did not have an o ahead of it, you could type the regular expression p(?!o) to match the following in our example:

Poppa took the apple to Paul. The Pope placed it in the box labeled 4.

Below is a summary of the syntax for assertions:

  • Lookahead: x(?=y) – Matches every x that has y ahead of it
  • Lookbehind: (?<=y)x – Matches every x that has y behind it
  • Negative Lookahead: x(?!y) – Matches every x that does not have y ahead of it
  • Negative Lookbehind: (?<y)x – Matches every x that does not have y behind it

Substitutions

Up until now we have only covered how to find specific pieces of content utilizing regular expressions. However, an important part of eBook design is adding and replacing content as you work your way toward having an eBook with valid HTML markup. Sometimes you may just need to replace text you find with an exact replacement (e.g. replacing all tabs with a blank to remove them). However, you often will need to replace the text based on what was matched through the regular expression. An example would be wrapping paragraph tags (<p> and </p>) around each line of text.

Substitutions are an integral part of regular expressions, in that they instruct the text editor to replace content based on the text within the match itself. For example, say you have wrapped some italicized text with the 1xxx2 and 2xxx1 placeholder tags as follows:

1xxx2italics text2xxx1 not italics text 1xxx2more italics text2xxx1

To select the entire italics text and the placeholder tags wrapped around it, a perfectly acceptable regular expression would be 1xxx2.*?2xxx1. This regex would select the following two matches:

1xxx2italics text2xxx1 not italics text 1xxx2more italics text2xxx1

To convert the placeholder tags into meaningful HTML markup, you need to replace the opening placeholder 1xxx2 with <span class="i"> and the closing placeholder 2xxx1 with </span>. However, every match has different content inside the placeholders. Therefore you need to use a substitution inside the replace regular expression to match the content within the placeholders of the match.

A substitution is declared in the find regular expression by using opening (i.e. () and closing parentheses (i.e. )) around any number of terms.

Try typing (1xxx2)(.*?)(2xxx1) in the find field. The text editor will match the exact same results. However, the text editor can now reference the expressions inside the parentheses with substitutions. \1 in the replace field represents the 1xxx2 match, \2 represents whatever is matched by .*?, and \3 represents whatever is matched by 2xxx1. Notice that the substitutions are referenced in order from left to right based on the parentheses in the find field.

Using (1xxx2)(.*?)(2xxx1)in the find field, try typing in <span class="i">\2</span> in the replace field. Clicking Replace on each match will alter the text as follows:

italics text not italics text more italics text

This is a powerful way to markup your eBook’s HTML. However, it takes some getting used to and don’t feel discouraged if it is not sinking in yet. In the next section we will take a look at practical regular expressions for eBook designers that will help you make short work out of eBook design.


Practical Regular Expressions for eBook Design

Important Note: These expressions assume that you are using Windows encoding in your text editor, which marks end of lines as \r\n. If you are using Mac or Unix encoding, the end of lines are simply \n. If you are using a really, really old Mac, it might actually be \r.

Turn Double Spaces into Single Space

Find:  +

Replace:  

Explanation: Please note the spaces in the regular expressions. The find regular expression matches any text that has two or more spaces. The replace changes the match to a single space. This is useful for old-school writers who can’t break the habit of double-spacing after periods.

Delete All Tabs

Find: \t

Replace: [blank]

Explanation: Tabs can create problems in eBooks based on the different ways they are rendered by various eReading devices. They must be stripped. Recall that \t is a special character that matches any tab. By leaving the replace field blank, you are essentially deleting all tabs throughout the entire document in one fell swoop.

Delete Blank Lines (Windows encoding)

Find: \r\n\r\n (you should click on the Extended button in the find/replace window)

Replace: \r\n

Explanation: Every line has a \r\n (or \n for Mac/Unix) at the end of it. Therefore, two in a row makes a blank line. This regular expression will replace any blank line. Run this repeatedly until you get no matches. For Mac/Unix users you would find \n\n and replace with a single \n.

Trim Leading Whitespace

Find: ^ +

Replace: [blank]

Explanation: The ^ anchor of the regular expression begins at the start of the line and looks for any spaces. By replacing these matches with a blank, you are deleting all spaces at the beginning of every line.

Trim Trailing Whitespace

Find:  +$

Replace: [blank]

Explanation: The $ anchor of the regular expression ends at the end of line and looks for any spaces before it. By replacing these matches with a blank, you are deleting all spaces at the beginning of every line.

Wrap Paragraph Tags on Every Line

Find: ^(.+)$

Replace: <p>\1</p>

Explanation: The ^ anchor starts at the beginning of the line. The (.+) finds one or more of any character (i.e. not a blank line). The $ anchor ends at the end of a line. The <p>\1lt;/p> replace regular expression wraps the paragraph tags around the text of the line represented by the \1 substitution.

Add a Class to Section Breaks

Find: ^<p>(\*+)</p>$

Replace: <p class="centeredbreak">\1</p>

Explanation: The find regular expression looks for any number of * (i.e. ***, ****, etc.) wrapped inside paragraph tags. The replace regular expression changes the leading <p> tag with <p class="centeredbreak"> to apply a CSS hook.

Add an alt Attribute to All Images

Find: <img src="(.*?)" />

Replace: <img src="\1" alt="My Awesome Image" />

Explanation: The find regular expression looks for any img markup without an alt attribute. Note that the multiplying operator inside the src has non-greedy behavior to prevent the match extending into other HTML markup that may have a " within it. A substitution is used in the replace regular expression so that the src attribute is unchanged.

Replacing & with &amp;

Find: &(?![#a-zA-Z0-9]+?;)

Replace: &amp;

Explanation: The find regular expression is looking for any & that does not already have HTML entity encoding ahead of it. You can use a negative lookahead assertion to see if the ampersand is part of an HTML entity or part of your text. This prevents repeatedly replacing & with &amp; if the ampersand is actually part of a HTML entity (e.g. &rsquo;).

Tip: If you are looking for help with regular expressions, please drop us a line through the BB eBooks Developer page.

Replacing Fancy Apostrophes

Find: ([a-zA-Z0-9])'

Replace: \1’

Explanation: The find regular expression is looking for any letter or number that is immediately followed by a normal apostrophe. It is enclosed in parentheses to facilitate substitution in the replace regular expression with \1. The replace regular expression turns the normal apostrophe into a fancy (i.e. curled) apostrophe.

Tip: If you are looking for help with regular expressions, please drop us a comment below, and we will endeavor to assist.