'\" t .\" Title: shapeclustering .\" Author: [see the "AUTHOR" section] .\" Generator: DocBook XSL Stylesheets vsnapshot .\" Date: 11/11/2024 .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" .TH "SHAPECLUSTERING" "1" "11/11/2024" "\ \&" "\ \&" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" shapeclustering \- shape clustering training for Tesseract .SH "SYNOPSIS" .sp shapeclustering \-D \fIoutput_dir\fR \-U \fIunicharset\fR \-O \fImfunicharset\fR \-F \fIfont_props\fR \-X \fIxheights\fR \fIFILE\fR\&... .SH "DESCRIPTION" .sp shapeclustering(1) takes extracted feature \&.tr files (generated by tesseract(1) run in a special mode from box files) and produces a file \fBshapetable\fR and an enhanced unicharset\&. This program is still experimental, and is not required (yet) for training Tesseract\&. .SH "OPTIONS" .PP \-U \fIFILE\fR .RS 4 The unicharset generated by unicharset_extractor(1)\&. .RE .PP \-D \fIdir\fR .RS 4 Directory to write output files to\&. .RE .PP \-F \fIfont_properties_file\fR .RS 4 (Input) font properties file, where each line is of the following form, where each field other than the font name is 0 or 1: .sp .if n \{\ .RS 4 .\} .nf \*(Aqfont_name\*(Aq \*(Aqitalic\*(Aq \*(Aqbold\*(Aq \*(Aqfixed_pitch\*(Aq \*(Aqserif\*(Aq \*(Aqfraktur\*(Aq .fi .if n \{\ .RE .\} .RE .PP \-X \fIxheights_file\fR .RS 4 (Input) x heights file, each line is of the following form, where xheight is calculated as the pixel x height of a character drawn at 32pt on 300 dpi\&. [ That is, if base x height + ascenders + descenders = 133, how much is x height? ] .sp .if n \{\ .RS 4 .\} .nf \*(Aqfont_name\*(Aq \*(Aqxheight\*(Aq .fi .if n \{\ .RE .\} .RE .PP \-O \fIFILE\fR .RS 4 The output unicharset that will be given to combine_tessdata(1)\&. .RE .SH "SEE ALSO" .sp tesseract(1), cntraining(1), unicharset_extractor(1), combine_tessdata(1), unicharset(5) .sp \m[blue]\fBhttps://tesseract\-ocr\&.github\&.io/tessdoc/Training\-Tesseract\&.html\fR\m[] .SH "COPYING" .sp Copyright (C) Google, 2011 Licensed under the Apache License, Version 2\&.0 .SH "AUTHOR" .sp The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985\-1995) and Google (2006\-2018)\&.