.\" Man page generated from reStructuredText. . . .nr rst2man-indent-level 0 . .de1 rstReportMargin \\$1 \\n[an-margin] level \\n[rst2man-indent-level] level margin: \\n[rst2man-indent\\n[rst2man-indent-level]] - \\n[rst2man-indent0] \\n[rst2man-indent1] \\n[rst2man-indent2] .. .de1 INDENT .\" .rstReportMargin pre: . RS \\$1 . nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin] . nr rst2man-indent-level +1 .\" .rstReportMargin post: .. .de UNINDENT . RE .\" indent \\n[an-margin] .\" old: \\n[rst2man-indent\\n[rst2man-indent-level]] .nr rst2man-indent-level -1 .\" new: \\n[rst2man-indent\\n[rst2man-indent-level]] .in \\n[rst2man-indent\\n[rst2man-indent-level]]u .. .TH "URLWATCH-FILTERS" "5" "Oct 28, 2024" "urlwatch " "urlwatch Documentation" .SH NAME urlwatch-filters \- Filtering output and diff data of urlwatch jobs .SH SYNOPSIS .sp urlwatch \-\-edit .SH DESCRIPTION .sp Each job can have two filter stages configured, with one or more filters processed after each other: .INDENT 0.0 .IP \(bu 2 Applied to the downloaded page before diffing the changes (\fBfilter\fP) .IP \(bu 2 Applied to the diff result before reporting the changes (\fBdiff_filter\fP) .UNINDENT .sp While creating your filter pipeline, you might want to preview what the filtered output looks like. You can do so by first configuring your job and then running urlwatch with the \fB\-\-test\-filter\fP command, passing in the index (from \fB\-\-list\fP) or the URL/location of the job to be tested: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C urlwatch \-\-test\-filter 1 # Test the first job in the list urlwatch \-\-test\-filter https://example.net/ # Test the job with the given URL .ft P .fi .UNINDENT .UNINDENT .sp The output of this command will be the filtered plaintext of the job, this is the output that will (in a real urlwatch run) be the input to the diff algorithm. .sp The \fBfilter\fP is only applied to new content, the old content was already filtered when it was retrieved. This means that changes to \fBfilter\fP are not visible when reporting unchanged contents (see \fI\%Display\fP for details), and the diff output will be between (old content with filter at the time old content was retrieved) and (new content with current filter). .sp Once urlwatch has collected at least 2 historic snapshots of a job (two different states of a webpage) you can use the command\-line option \fB\-\-test\-diff\-filter\fP to test your \fBdiff_filter\fP settings; this will use historic data cached locally. .SH BUILT-IN FILTERS .sp The list of built\-in filters can be retrieved using: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C urlwatch \-\-features .ft P .fi .UNINDENT .UNINDENT .sp At the moment, the following filters are built\-in: .INDENT 0.0 .IP \(bu 2 \fBbeautify\fP: Beautify HTML .IP \(bu 2 \fBcss\fP: Filter XML/HTML using CSS selectors .IP \(bu 2 \fBcsv2text\fP: Convert CSV to plaintext .IP \(bu 2 \fBelement\-by\-class\fP: Get all HTML elements by class .IP \(bu 2 \fBelement\-by\-id\fP: Get an HTML element by its ID .IP \(bu 2 \fBelement\-by\-style\fP: Get all HTML elements by style .IP \(bu 2 \fBelement\-by\-tag\fP: Get an HTML element by its tag .IP \(bu 2 \fBformat\-json\fP: Convert to formatted json .IP \(bu 2 \fBgrep\fP: Filter only lines matching a regular expression .IP \(bu 2 \fBgrepi\fP: Remove lines matching a regular expression .IP \(bu 2 \fBhexdump\fP: Convert binary data to hex dump format .IP \(bu 2 \fBhtml2text\fP: Convert HTML to plaintext .IP \(bu 2 \fBpdf2text\fP: Convert PDF to plaintext .IP \(bu 2 \fBpretty\-xml\fP: Pretty\-print XML .IP \(bu 2 \fBical2text\fP: Convert \fI\%iCalendar\fP <\fBhttps://en.wikipedia.org/wiki/ICalendar\fP> to plaintext .IP \(bu 2 \fBocr\fP: Convert text in images to plaintext using Tesseract OCR .IP \(bu 2 \fBre.sub\fP: Replace text with regular expressions using Python\(aqs re.sub .IP \(bu 2 \fBre.findall\fP: Find all non\-overlapping matches using Python\(aqs re.findall .IP \(bu 2 \fBreverse\fP: Reverse input items .IP \(bu 2 \fBsha1sum\fP: Calculate the SHA\-1 checksum of the content .IP \(bu 2 \fBshellpipe\fP: Filter using a shell command .IP \(bu 2 \fBsort\fP: Sort input items .IP \(bu 2 \fBremove\-duplicate\-lines\fP: Remove duplicate lines (case sensitive) .IP \(bu 2 \fBstrip\fP: Strip leading and trailing whitespace .IP \(bu 2 \fBstriplines\fP: Strip leading and trailing whitespace in each line .IP \(bu 2 \fBxpath\fP: Filter XML/HTML using XPath expressions .IP \(bu 2 \fBjq\fP: Filter, transform and extract values from JSON .UNINDENT .SH PICKING OUT ELEMENTS FROM A WEBPAGE .sp You can pick only a given HTML element with the built\-in filter, for example to extract \fB
.../
\fP from a page, you can use the following in your \fBurls.yaml\fP: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C url: http://example.org/idtest.html filter: \- element\-by\-id: something .ft P .fi .UNINDENT .UNINDENT .sp Also, you can chain filters, so you can run html2text on the result: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C url: http://example.net/id2text.html filter: \- element\-by\-id: something \- html2text .ft P .fi .UNINDENT .UNINDENT .SH CHAINING MULTIPLE FILTERS .sp The example urls.yaml file also demonstrates the use of built\-in filters, here 3 filters are used: html2text, line\-grep and whitespace removal to get just a certain info field from a webpage: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C url: https://example.net/version.html filter: \- html2text \- grep: \(dqCurrent.*version\(dq \- strip .ft P .fi .UNINDENT .UNINDENT .SH EXTRACTING ONLY THE TAG OF A PAGE .sp If you want to extract only the body tag you can use this filter: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C url: https://example.org/bodytag.html filter: \- element\-by\-tag: body .ft P .fi .UNINDENT .UNINDENT .SH FILTERING BASED ON AN XPATH EXPRESSION .sp To filter based on an \fI\%XPath\fP <\fBhttps://www.w3.org/TR/1999/REC-xpath-19991116/\fP> expression, you can use the \fBxpath\fP filter like so: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C url: https://example.net/xpath.html filter: \- xpath: /html/body/marquee .ft P .fi .UNINDENT .UNINDENT .sp This filters only the \fB\fP elements directly below the \fB\fP element, which in turn must be below the \fB\fP element of the document, stripping out everything else. .sp See Microsoft’s \fI\%XPath Examples\fP <\fBhttps://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx\fP> page for some other examples. You can also find an XPath of an \fB\fP node in the Chromium/Google Chrome developer tools by right clicking on the node and selecting \fBcopy XPath\fP\&. .SH FILTERING BASED ON CSS SELECTORS .sp To filter based on a \fI\%CSS selector\fP <\fBhttps://www.w3.org/TR/2011/REC-css3-selectors-20110929/\fP>, you can use the \fBcss\fP filter like so: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C url: https://example.net/css.html filter: \- css: ul#groceries > li.unchecked .ft P .fi .UNINDENT .UNINDENT .sp This would filter only \fB
  • \fP tags directly below \fB