URLWATCH-FILTERS(5) urlwatch URLWATCH-FILTERS(5) NAME urlwatch-filters - Filtering output and diff data of urlwatch jobs SYNOPSIS urlwatch --edit DESCRIPTION Each job can have two filter stages configured, with one or more filters processed after each other: o Applied to the downloaded page before diffing the changes (filter) o Applied to the diff result before reporting the changes (diff_filter) While creating your filter pipeline, you might want to preview what the filtered output looks like. You can do so by first configuring your job and then running urlwatch with the --test-filter command, passing in the index (from --list) or the URL/location of the job to be tested: urlwatch --test-filter 1 # Test the first job in the list urlwatch --test-filter https://example.net/ # Test the job with the given URL The output of this command will be the filtered plaintext of the job, this is the output that will (in a real urlwatch run) be the input to the diff algorithm. The filter is only applied to new content, the old content was already filtered when it was retrieved. This means that changes to filter are not visible when reporting unchanged contents (see Display for details), and the diff output will be between (old content with filter at the time old content was retrieved) and (new content with current filter). Once urlwatch has collected at least 2 historic snapshots of a job (two different states of a webpage) you can use the command-line option --test-diff-filter to test your diff_filter settings; this will use historic data cached locally. BUILT-IN FILTERS The list of built-in filters can be retrieved using: urlwatch --features At the moment, the following filters are built-in: o beautify: Beautify HTML o css: Filter XML/HTML using CSS selectors o csv2text: Convert CSV to plaintext o element-by-class: Get all HTML elements by class o element-by-id: Get an HTML element by its ID o element-by-style: Get all HTML elements by style o element-by-tag: Get an HTML element by its tag o format-json: Convert to formatted json o grep: Filter only lines matching a regular expression o grepi: Remove lines matching a regular expression o hexdump: Convert binary data to hex dump format o html2text: Convert HTML to plaintext o pdf2text: Convert PDF to plaintext o pretty-xml: Pretty-print XML o ical2text: Convert iCalendar to plaintext o ocr: Convert text in images to plaintext using Tesseract OCR o re.sub: Replace text with regular expressions using Python's re.sub o reverse: Reverse input items o sha1sum: Calculate the SHA-1 checksum of the content o shellpipe: Filter using a shell command o sort: Sort input items o remove-duplicate-lines: Remove duplicate lines (case sensitive) o strip: Strip leading and trailing whitespace o striplines: Strip leading and trailing whitespace in each line o xpath: Filter XML/HTML using XPath expressions o jq: Filter, transform and extract values from JSON PICKING OUT ELEMENTS FROM A WEBPAGE You can pick only a given HTML element with the built-in filter, for example to extract
.../
from a page, you can use the following in your urls.yaml: url: http://example.org/idtest.html filter: - element-by-id: something Also, you can chain filters, so you can run html2text on the result: url: http://example.net/id2text.html filter: - element-by-id: something - html2text CHAINING MULTIPLE FILTERS The example urls.yaml file also demonstrates the use of built-in filters, here 3 filters are used: html2text, line-grep and whitespace removal to get just a certain info field from a webpage: url: https://example.net/version.html filter: - html2text - grep: "Current.*version" - strip EXTRACTING ONLY THE TAG OF A PAGE If you want to extract only the body tag you can use this filter: url: https://example.org/bodytag.html filter: - element-by-tag: body FILTERING BASED ON AN XPATH EXPRESSION To filter based on an XPath expression, you can use the xpath filter like so: url: https://example.net/xpath.html filter: - xpath: /html/body/marquee This filters only the elements directly below the element, which in turn must be below the element of the document, stripping out everything else. See Microsoft's XPath Examples page for some other examples. You can also find an XPath of an node in the Chromium/Google Chrome developer tools by right clicking on the node and selecting copy XPath. FILTERING BASED ON CSS SELECTORS To filter based on a CSS selector , you can use the css filter like so: url: https://example.net/css.html filter: - css: ul#groceries > li.unchecked This would filter only
  • tags directly below