Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Redirecting stdin to parallel scripts
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
tkzv
Tux's lil' helper
Tux's lil' helper


Joined: 22 Aug 2014
Posts: 92

PostPosted: Sun Feb 09, 2025 2:59 am    Post subject: Redirecting stdin to parallel scripts Reply with quote

I use a simple script to recognize text from screenshots with a hotkey:
Code:
xclip -out -selection clipboard -target image/jpeg | { tesseract - - -l eng+fra+esp; echo ; } | xclip -selection clipboard -in


Sometimes, the screenshot isn't contrasting enough, or the background varies too much, or some other trouble... I found that the quickest solution is OCRing the image at several different gamma values and picking the best (parts) in the text editor. I wrote the following script that does not need any temporary files:

Code:
function ocr () ( tesseract - - -l eng+fra+esp; echo $1 ; echo ; )

function vary_gamma () ( echo $1; convert - -gamma $1 -format jpeg - | ocr $1 )

xclip -out -selection clipboard -target image/jpeg | tee \
        >( vary_gamma 0.0625 ) \
        >( vary_gamma 8 ) \
        >( vary_gamma 0.125 ) \
        >( vary_gamma 4 ) \
        >( vary_gamma 0.25 ) \
        >( vary_gamma 2 ) \
        >( vary_gamma 0.5 ) \
        >( echo 1; ocr 1 ) \
        > /dev/null \
    | xclip -selection clipboard -in


There are several problems with it:
    1. The output from different subprocesses is mixed. This doesn't seem to be a problem with the recognized text, as Tesseract seems to output it as a single chunk, but the first `echo $1` is printed separately. How do I post all output from a vary_gamma instance as a single unbroken piece of text?
    2. All processes try to run simultaneously. I didn't quite understand how it is organized, but on an 8-core processor the number of parallel tesseract processes never exceeded 8, and each OCR task created 2 processes. The results are output in the opposite order, in this case 1 - 0.5 - 2 0.25 - 4 - 0.125 - 8 - 0.0625. Sometimes I need to maximize the number of jobs, sometimes I need to run them sequentially. Is there a simple way to organize them like GNU Parallel does?
    3. Is it possible to write this shorter, for example instead of writing each gamma on a separate line use a for loop? Like
    Code:

    ... | tee >( ocr 1 ) >( for gamma in 0.5 2 0.25 4 0.125 8 0.0625; do tee >( vary_gamma $gamma ) ; done ) | ...

    (I know this doesn't work.)
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2204

PostPosted: Sun Feb 09, 2025 10:15 am    Post subject: Reply with quote

You possibly already know this, but tesseract is limited to 4 threads - it's a design feature that I suspect is left over from the days when most processors had fewer than 4 CPUs...
_________________
Greybeard
Back to top
View user's profile Send private message
tkzv
Tux's lil' helper
Tux's lil' helper


Joined: 22 Aug 2014
Posts: 92

PostPosted: Sun Feb 09, 2025 10:46 am    Post subject: Reply with quote

Goverp wrote:
You possibly already know this, but tesseract is limited to 4 threads - it's a design feature that I suspect is left over from the days when most processors had fewer than 4 CPUs...

No, never heard of it. What about running tesseract in several separate shells? Does the limitation still apply?
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23103

PostPosted: Sun Feb 09, 2025 3:32 pm    Post subject: Re: Redirecting stdin to parallel scripts Reply with quote

tkzv wrote:
    1. The output from different subprocesses is mixed. This doesn't seem to be a problem with the recognized text, as Tesseract seems to output it as a single chunk, but the first `echo $1` is printed separately. How do I post all output from a vary_gamma instance as a single unbroken piece of text?
Any time two or more calls to write are used to print the output, this can happen. Either arrange for Tesseract to write the value of $1 as part of its output, or arrange for all output from vary_gamma to be written to a temporary buffer, and then printed as a unit once the buffer is complete. A temporary file is the easiest, but not the only, solution here. If you don't want to deal with cleanup, some trickery with shell redirection can arrange for bash to make the temporary files for you.

Modern GNU make has a feature to handle this output buffering for you. See --output-sync in info make.
tkzv wrote:
    2. All processes try to run simultaneously. I didn't quite understand how it is organized, but on an 8-core processor the number of parallel tesseract processes never exceeded 8, and each OCR task created 2 processes. The results are output in the opposite order, in this case 1 - 0.5 - 2 0.25 - 4 - 0.125 - 8 - 0.0625. Sometimes I need to maximize the number of jobs, sometimes I need to run them sequentially. Is there a simple way to organize them like GNU Parallel does?
As written, you create one process for each of the process substitutions (>( ... )). If you don't want to run them all at once, then you need to avoid creating them all at the beginning.
tkzv wrote:
    3. Is it possible to write this shorter, for example instead of writing each gamma on a separate line use a for loop? Like
Yes, but I think you are getting to the point you should use a more powerful language. Bash can be made to do what you want, but the script will quickly become hard to maintain.
Code:
#!/bin/bash

set -euo pipefail

function vary_gamma() {
   printf 'vary_gamma: begin %s\n' "$1"
   # Block until all input has been provided.  In a real script, this
   # would be the call to `tesseract`.
   cat >/dev/null
   # Delay so that the user can see that `wait` blocks waiting for
   # `vary_gamma` to finish.
   sleep 1
   printf 'vary_gamma: end %s\n' "$1"
}

declare -a subprocesses
for i; do
   # Start a vary_gamma subprocess.  Set aside its input descriptor, and do
   # not give it any input yet.
   exec {fd}> >( vary_gamma "$i" )
   subprocesses+=( "$fd" )
done
# Debug print the created descriptors
printf 'subprocesses: \n'
printf '\t%q\n' "${subprocesses[@]/#//proc/self/fd/}"
# Feed some dummy input to all the child processes.
tee <<<X "${subprocesses[@]/#//proc/self/fd/}"
# Close the descriptors, so that the child process receives an end-of-file.
for fd in "${subprocesses[@]}"; do
   exec {fd}>&-
done
# Wait for the child processes to notice the end-of-file and exit.  Without
# this, the processing delay (simulated via `sleep`) causes the `end` message
# to appear after the shell prompt returns.
wait
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2204

PostPosted: Mon Feb 10, 2025 8:56 am    Post subject: Reply with quote

tkzv wrote:
Goverp wrote:
You possibly already know this, but tesseract is limited to 4 threads - it's a design feature that I suspect is left over from the days when most processors had fewer than 4 CPUs...

No, never heard of it. What about running tesseract in several separate shells? Does the limitation still apply?

From "man tesseract":
Quote:
ENVIRONMENT VARIABLES

TESSDATA_PREFIX

If the TESSDATA_PREFIX is set to a path, then that path is used to find the tessdata directory with language and script recognition models and config files. Using --tessdata-dir PATH is the recommended alternative.

OMP_THREAD_LIMIT

If the tesseract executable was built with multithreading support, it will normally use four CPU cores for the OCR process. While this can be faster for a single image, it gives bad performance if the host computer provides less than four CPU cores or if OCR is made for many images. Only a single CPU core is used with OMP_THREAD_LIMIT=1.

I presume running "n" copies of tesseract in separate processes will let you run 4n threads.
_________________
Greybeard
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum