首頁
日誌
技巧心得
2025th
正則表達式實戰指南：從基礎到爬蟲應用

正則表達式實戰指南：從基礎到爬蟲應用

March 24, 2025

基礎語法

正則表達式是一種強大的文本模式匹配工具，以下是基本語法元素的對照表：

字元	說明	實例	匹配結果
`.`	匹配任意單個字元(除換行符外)	`a.c`	abc, adc, a$c…
`\d`	匹配任意數字	`\d{3}`	123, 456…
`\w`	匹配字母、數字或下劃線	`\w+`	abc, a123, test_123…
`\s`	匹配空白字元(空格、製表符、換行符)	`a\sb`	“a b”, “a\tb”…
`^`	匹配字串開頭	`^Hello`	“Hello world”
`$`	匹配字串結尾	`world$`	“Hello world”
`*`	匹配前面的模式零次或多次	`a*b`	b, ab, aab…
`+`	匹配前面的模式一次或多次	`a+b`	ab, aab…
`?`	匹配前面的模式零次或一次	`colou?r`	color, colour
`{n}`	匹配前面的模式恰好n次	`a{3}`	aaa
`{n,}`	匹配前面的模式至少n次	`a{2,}`	aa, aaa, aaaa…
`{n,m}`	匹配前面的模式n到m次	`a{2,4}`	aa, aaa, aaaa
`	`	邏輯或	`cat
`[]`	字元集合，匹配括號內任意字元	`[abc]`	a, b, c
`[^]`	字元集合，匹配非括號內的任意字元	`[^abc]`	d, e, f…
`()`	捕獲組，可以提取匹配的內容	`(abc)`	提取 “abc”
`(?:)`	非捕獲組，不提取匹配的內容	`(?:abc)`	不提取 “abc”

匹配模式修飾符

修飾符	說明
`g`	全局匹配 (找出所有匹配項而非僅第一個)
`i`	忽略大小寫
`m`	多行匹配
`s`	使 `.` 可以匹配換行符

HTML 提取常用技巧

最小匹配: .*? 而不是 .* (貪婪匹配)
屬性匹配: [^>]* 可以匹配標籤內任意屬性
捕獲組: () 用於提取需要的內容

常用正則表達式範例

以下是 20 個實用的正則表達式範例：

1. 電子郵件地址

const emailRegex = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;

匹配例子: john.doe@example.com, user123@gmail.co.uk

2. URL

const urlRegex = /^(https?:\/\/)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$/;

匹配例子: https://www.example.com, example.com/path?query=123

3. 電話號碼 (國際格式)

const phoneRegex = /^\+?(\d{1,3})?[\s-]?\(?\d{1,4}\)?[\s-]?\d{1,4}[\s-]?\d{1,9}$/;

匹配例子: +1 (123) 456-7890, 0912-345-678

4. IP地址 (IPv4)

const ipv4Regex = /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/;

匹配例子: 192.168.1.1, 127.0.0.1

5. 日期 (YYYY-MM-DD 格式)

const dateRegex = /^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$/;

匹配例子: 2023-01-31, 2022-12-01

6. 時間 (HH:MM:SS 或 HH:MM 格式)

const timeRegex = /^([01]?[0-9]|2[0-3]):([0-5][0-9])(?::([0-5][0-9]))?$/;

匹配例子: 13:45:30, 09:05

7. 密碼強度 (至少8位，包含大小寫字母、數字和特殊字符)

const strongPasswordRegex = /^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$/;

匹配例子: Password123!, Secure@9876

8. 信用卡號碼

const creditCardRegex = /^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})$/;

匹配例子: 4111111111111111 (Visa), 5500000000000004 (MasterCard)

9. 郵政編碼 (美國)

const usZipCodeRegex = /^\d{5}(?:-\d{4})?$/;

匹配例子: 12345, 12345-6789

10. 社交安全號碼 (美國)

const ssnRegex = /^(?!000|666|9\d{2})([0-8]\d{2}|7([0-6]\d|7[012]))(?!00)\d{2}(?!0000)\d{4}$/;

匹配例子: 123-45-6789

11. HTML 標籤

const htmlTagRegex = /<\/?[a-z][^>]*>/gi;

匹配例子: , ,

12. 提取引號內容

const quotesContentRegex = /"([^"]*)"/g;

匹配例子: 從 ‘He said “hello world”’ 中提取 “hello world”

13. 十六進制顏色代碼

const hexColorRegex = /^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$/;

匹配例子: #FF0000, #F00, FF0000

14. 數字範圍驗證 (0-100)

const numberRangeRegex = /^([0-9]|[1-9][0-9]|100)$/;

匹配例子: 0, 42, 100

15. 用戶名 (字母開頭，允許字母、數字、下劃線，長度3-16)

const usernameRegex = /^[a-zA-Z][a-zA-Z0-9_]{2,15}$/;

匹配例子: john_doe123, user_2023

16. 中文字符

const chineseCharRegex = /[\u4e00-\u9fa5]/g;

匹配例子: 你好世界

17. 提取標籤內容

const scriptContentRegex = /<script[^>]*>([\s\S]*?)<\/script>/gi;

匹配例子: console.log(‘hello’);

18. 提取 CSV 文件中的行

const csvLineRegex = /(?:^|,)(?:"([^"]*(?:""[^"]*)*)"|([^,]*))/g;

匹配例子: 1,“John, Doe”,30

19. 移除多餘空白

const removeExtraSpacesRegex = /\s{2,}/g;

使用例子: “Hello world”.replace(removeExtraSpacesRegex, " “) 得到 “Hello world”

20. 純數字驗證

const digitsOnlyRegex = /^\d+$/;

匹配例子: 12345, 67890

網頁爬蟲實戰案例

Google 新聞搜尋結果提取

以下是用於提取 Google 新聞搜尋結果的正則表達式：

標題正則表達式

const titleRegex = /<h[3-4][^>]*>.*?<a[^>]*>(.*?)<\/a>.*?<\/h[3-4]>/gs;

連結正則表達式

const linkRegex = /<h[3-4][^>]*>.*?<a[^>]*?href="([^"]*)"[^>]*>.*?<\/a>.*?<\/h[3-4]>/gs;

簡介正則表達式

const descriptionRegex = /<div class="[^"]*?st"[^>]*>(.*?)<\/div>|<span class="[^"]*?st"[^>]*>(.*?)<\/span>/gs;

另一種簡介可能的位置

const altDescriptionRegex = /<div class="[^"]*?VwiC3b[^"]*"[^>]*>(.*?)<\/div>/gs;

LinkedIn 職務搜尋結果提取

以下是用於提取 LinkedIn 職務搜尋結果的正則表達式：

職務名稱正則表達式

const jobTitleRegex = /<h3[^>]*class="[^"]*base-search-card__title[^"]*"[^>]*>.*?<a[^>]*>(.*?)<\/a>.*?<\/h3>/gs;

公司名稱正則表達式

const companyNameRegex = /<h4[^>]*class="[^"]*base-search-card__subtitle[^"]*"[^>]*>.*?<a[^>]*>(.*?)<\/a>.*?<\/h4>/gs;

職務連結正則表達式

const jobLinkRegex = /<h3[^>]*class="[^"]*base-search-card__title[^"]*"[^>]*>.*?<a[^>]*?href="([^"]*)"[^>]*>.*?<\/a>.*?<\/h3>/gs;

工作地點正則表達式

const locationRegex = /<span[^>]*class="[^"]*job-search-card__location[^"]*"[^>]*>(.*?)<\/span>/gs;

發布時間正則表達式

const postedTimeRegex = /<time[^>]*>(.*?)<\/time>/gs;

薪資範圍正則表達式 (如果有)

const salaryRegex = /<span[^>]*class="[^"]*job-search-card__salary-info[^"]*"[^>]*>(.*?)<\/span>/gs;

工作類型正則表達式 (全職、兼職等)

const jobTypeRegex = /<li[^>]*class="[^"]*job-search-card__job-type[^"]*"[^>]*>(.*?)<\/li>/gs;

在 make.com 平台的應用

Make.com (前身為 Integromat) 提供了 HTML to text 和 Text parser 模組，可以結合使用這些正則表達式來提取和處理網頁數據。

Google 新聞搜尋結果提取步驟：

添加 HTTP 模組
- 設置為 GET 請求並輸入 Google 新聞搜索 URL
添加 HTML to text 模組
- 連接到 HTTP 模組的輸出
添加 Text parser 模組
- 為標題設置項目，使用 titleRegex
- 為連結設置項目，使用 linkRegex
- 為簡介設置項目，使用 descriptionRegex
- 可選：為另一種簡介格式設置項目，使用 altDescriptionRegex

Text parser 模組設置示例：

提取標題：
- Pattern type: 選擇 “Regular expression”
- Pattern: <h[3-4][^>]*>.*?<a[^>]*>(.*?)<\/a>.*?<\/h[3-4]>
- Output: {{1}}（這會提取第一個捕獲組，即標題文字）
提取連結：
- Pattern type: 選擇 “Regular expression”
- Pattern: <h[3-4][^>]*>.*?<a[^>]*?href="([^"]*)"[^>]*>.*?<\/a>.*?<\/h[3-4]>
- Output: {{1}}（這會提取第一個捕獲組，即連結 URL）
提取簡介：
- Pattern type: 選擇 “Regular expression”
- Pattern: <div class="[^"]*?st"[^>]*>(.*?)<\/div>|<span class="[^"]*?st"[^>]*>(.*?)<\/span>
- Output: {{1}}{{2}}（這會提取第一或第二個捕獲組，取決於哪個匹配）

實戰技巧與注意事項

1. 網站結構變化的應對策略

網站經常更新其 HTML 結構，導致原有的正則表達式無法正常工作。以下是應對策略：

定期檢查和更新正則表達式：設置定期的監控任務
使用更寬鬆的模式：減少對特定 CSS 類名的依賴
實施錯誤處理機制：當提取失敗時有備用方案

2. 爬蟲限制與合規性

大型網站通常有爬蟲限制，需要注意：

遵守網站的使用條款：LinkedIn 和 Google 都有限制自動爬取的條款
控制請求頻率：避免在短時間內發送太多請求
添加適當的 User-Agent：模擬正常的瀏覽器訪問
考慮使用官方 API：如果可能，優先使用官方提供的 API

3. 正則表達式優化技巧

使用非貪婪匹配：.*? 而不是 .*
限制匹配範圍：例如使用 ^ 和 $ 精確匹配
避免過度複雜：複雜的正則表達式難以維護和調試
分解複雜模式：使用多個簡單的正則表達式而不是一個複雜的

4. 測試與驗證

在實際使用前，確保測試和驗證正則表達式：

使用 regex101.com 等工具：可視化測試正則表達式
小規模測試：先在小樣本上測試
驗證提取結果：確認提取的數據是否正確
添加日誌記錄：記錄匹配和不匹配的情況以便調試

5. 處理特殊情況

HTML 實體：處理 &, < 等 HTML 實體
多語言支持：考慮各種語言的字符和編碼
動態加載內容：某些內容可能是通過 JavaScript 動態加載的，無法直接用正則表達式提取

Notion 完整教案：從入門到精通 XPath 爬蟲語法大全：完整對照表與實戰技巧