OCR technology

Posting mode: Reply [Return]
Name21102
Options	+k\bV9
Subject	Spoiler Image
Comment
Verification
Flag
File
Embed
Password	(For file deletion.)

File: 1760630271687.pdf (8.23 MB, 57x255, Dr. Joseph Slavit - The Ne….pdf)

OCR technology Anonymous 16-10-25 15:57:51 No.31554

I'm a volunteer for Marxists.org. Finding forgotten gold articles from 100 years ago and sharing them with the modern world is my jam. The problem is that they're often microfilm scans that are a pain in the ass to read, so I have to transcribe them - which rarely goes smoothly with my OCR software. A lot of the time I have to resort to typing everything out by sight, which as you can imagine takes forever.

That OCR software is ABBYY FineReader 15, said to be the best when I pirated it right before the big machine learning breakthroughs. Is "AI" able to work magic for optical character recognition now?

Attached is a book-length article I'd like to transcribe. It's mostly too fuzzy for FineReader 15 to handle. I was originally going to call on /leftypol/ to help me transcribe it by hand, but I thought I'd ask /tech/ first to see if a machine can do it after all.

TL;DR: help me transcribe this plz.

Anonymous 16-10-25 17:43:12 No.31557

I think most popular now is tesseract-ocr, have you tried it?

Anonymous 16-10-25 18:02:28 No.31558

>>31557
I'll admit I haven't tried it. A quick Google search seems to imply that it's a command line utility? I'm willing to learn how to use that if it's worth it (I do a lot of this kind of stuff), but I have a pretty baseline understanding of tech for an imageboard user (not even on linux smh)

Anonymous 16-10-25 18:05:12 No.31559

>>31558
I found tesseract easy to use, idk how it is in linux but I just installed the main package and then the language packages.
I assume it wouldn't be much harder in windows.
Please report your results back, to see how much OCR software has improved since (apparently) 2021, you could post a comparison of a hand-written transcription, an ABBYY scan and a tesseract one.

Anonymous 16-10-25 18:44:00 No.31560

>>31559
I have the command line utility installed and I'm already over my head. I can't seem to input the file location for the file I want to convert.

Anonymous 16-10-25 20:47:24 No.31561

U on windows?
I found this on Google
https://github.com/cloudy-sfu/GUI-for-tesseract-OCR

Haven't tested it though, I blew my windows SSD to install more games

OP 17-10-25 19:25:20 No.31568

>>31560
An update: this is the error message I'm getting when I input this command:

tesseract C:\Users\Anon\etc.\testFile.pdf -l eng output

Then it gives me this error:

Error, could not create TXT output file: Permission denied

Does anyone know what's going on?

OP 17-10-25 19:37:07 No.31569

>>31568
I feel stupid, that error was easily resolved by just running the command prompt as Administrator in the right-click menu. But now it seems to be telling me that PDFs aren't supported?

Error in pixReadStream: Pdf reading is not supported

Leptonica Error in pixRead: pix not read: [file location]

That seems like a pretty massive oversight if you ask me, because pretty much every single source or scan output is in PDF form.

OP 17-10-25 23:20:04 No.31570

>>31569
OK, I've been fucking around with it a bit more by screenshotting the PDF and then working with those PNGs. I've finally been able to get the software to work, and the result is… disappointing. Half the time it will just flat out refuse to read the text and spit out errors like "invalid box", and when it does work, I've found that it's worse than ABBYY 15 anyway.

It's entirely possible that I'm just not doing it right because I have almost no experience with command line programs. Am I? Are there other alternatives than Tesseract?

Glownonymous 18-10-25 03:41:05 No.31571

File: 1760758865026.zip (2.02 KB, OCR.zip)

>>31554
Looked into this briefly at one point and came to the conclusion that tesseract was the best tool, but too complicated for me to use at the time. LLMs are not currently trained for processing PDFs, and so you need to extract the pages into individual images which can be processed, this results in some growth in the file sizes at five million tokens per USD you're looking at a dollar every ten pages or so, unless you host the model locally or wait for scaling to kick in… Attached is a script that does this using Groq's API, you can see the results for the first page here: https://0x0.st/KQ-U.txt

Glownonymous 18-10-25 16:09:22 No.31574

>>31571
>host the model locally
You'd need a monster setup to host this locally.

>>31571
>briefly at one point […] tesseract
This is fortunately outdated information.
Looks like EasyOCR [^1] Is the easiest highest quality OCR available today.
The interface is roughly the same as in the previous script.
- Split the PDF into images.
- Pass this to the OCR function.
Unfortunately this seems to be extremely slow on my hardware (no dedicated GPU).
And further I wasn't able to get it to output anything.
If you'd like me to try again just tell me, and I'll give it another shot!

:[^1] https://github.com/JaidedAI/EasyOCR

Glownonymous 19-10-25 17:01:59 No.31581

>>31574
Looks like there's a convenient wrapper for tesseract/EasyOCR called OCRmyPDF [^1].

ocrmypdf -l eng+deu --output-type=none --redo-ocr --sidecar output.txt input.pdf - >/dev/null

If you install the OCRmyPDF EasyOCR plugin for higher quality output [^1]:

ocrmypdf -l eng+deu --pdf-renderer sandwich --output-type=none --redo-ocr --sidecar output.txt input.pdf - >/dev/null

On windows think you'd have to swap out the >/dev/null for > NUL, but maybe nothing else.
There are a couple of dependencies to get this to work [^3].
Output for the full document using the latter setup follows: https://0x0.st/K1QW.txt
Unfortunately even with this most advanced affordable technology the output is still rubbish.
It took maybe fifteen or twenty minutes to run on my system without a dedicated GPU for these ten pages.
It might be for tough cases like this that it would actually be worth spending the dollar for the LLM output…

:[^1] https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#basic-examples
:[^2] https://github.com/ocrmypdf/OCRmyPDF-EasyOCR
:[^3] https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-windows
https://github.com/ocrmypdf/OCRmyPDF-EasyOCR

Glownonymous 21-10-25 06:21:15 No.31586

>>31571
>five million tokens per USD you're looking at a dollar every ten pages or so.
Was able to simply extract the JPEG from the PDF, and to upload these instead of using base64.
This is because we were accidentally up-scaling and the base64 adds 33%.
This resulted in a savings of around 75%.
Tried using the cheaper scout model but this wasn't able to OCR all the files.
Also my original computation for the cost was off.
Empirically the cost to run this scan was 0.018803 USD.
The interface really should be modified to allow for the inspection and optional rencoding of the images.

Anyway here's the full scan: https://0x0.st/KjlU.txt

import os
import pymupdf
import pathlib
import requests
import argparse
from groq import Groq
from typing import List

def render_pages(
    src_pdf: pathlib.Path,
) -> List[bytes]:
    doc = pymupdf.open(src_pdf)
    images = []
    for page in doc:
        for img in page.get_images():
            xref = img[0]
            pix = doc.extract_image(xref)
            images.append(pix['image'])
    return images

# def render_pages(
#     src_pdf: pathlib.Path,
#     resolution: int = 200,
# ) -> List[bytes]:
#     pages = []
#     with Image(
#         filename=str(src_pdf) + "[0]",
#         resolution=resolution,
#         depth=8,
#         colorspace="gray"
#     ) as images:
#         for image in images.sequence:
#             compressed = Image(image=image)
#             compressed.colorspace = "gray"
#             compressed.depth = 8
#             # compressed.format = "jpeg"
#             # compressed.compression = "jpeg"
#             # compressed.strip()
#             # buffer = io.BytesIO()
#             compressed.save(filename="temp.jpg")
#             # pages.append(buffer.getvalue())
#     return pages

def upload_image(
    image: bytes, 
) -> str:
    return requests.post('https://0x0.st',
        files={'file': ('image.jpeg', image, 'image/jpeg')},
        headers={'User-Agent': 'curl/7'}).text.strip()

def OCR_image (image_url: str) -> str:
    client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Return the plain text content of this JPEG image verbatim."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": image_url,
                        },
                    },
                ],
            }
        ],
        model="meta-llama/llama-4-maverick-17b-128e-instruct",
    )

    completion = chat_completion.choices[0].message.content
    return completion if completion else ""

def main():
    ap = argparse.ArgumentParser(description="Split PDF and render pages to OCR-friendly images")
    ap.add_argument("pdf", help="input PDF file")
    ap.add_argument("--out-dir", "-o", type=pathlib.Path, help="output folder (default: ./pdfname_imgs)")
    args = ap.parse_args()

    src_pdf = pathlib.Path(args.pdf).expanduser().resolve()
    out_dir = args.out_dir
    out_dir.mkdir(exist_ok=True)

    images = render_pages(src_pdf)
    with (out_dir / src_pdf.with_suffix("").with_suffix(".txt").name).open("a") as file:
      for image in images:
        url = upload_image(image)
        file.write(OCR_image(url))
    
if __name__ == "__main__":
    main()

Glownonymous 21-10-25 07:01:33 No.31587

>>31586
Better yet do encode image by image, but not in a stupid way:

def render_pages(
    src_pdf: pathlib.Path,
    resolution: int,
) -> List[bytes]:
    pages = []
    with Image() as images:
        images.resolution=resolution
        images.read(filename=src_pdf)
        images.depth=8
        images.colorspace="gray"
        for image in images.sequence:
            compressed = Image(image=image)
            compressed.format = 'jpeg'
            buffer = io.BytesIO()
            compressed.save(file=buffer)
            pages.append(buffer.getvalue())
    return pages

OP 21-10-25 16:39:19 No.31590

>>31586
I'll admit that I have no idea how to use that code or what it means, but thanks for creating the script to transcribe this! There are a fair amount of errors, but not as many as ABBYY would generate - it seems like the LLM is a lot better at cutting out all the extraneous exponents and apostrophes that ABBYY would pick up from the film grain.

The downside from the LLM is that you sometimes get some true bizarre hallucinations. Like this one in the opening sentence of part IV. Here it is typed out by sight:

In the first article we introduced the reader to Comrade William English Walling, the "new" Duehring, who proposes a "new" Socialism based on "new" methods and principles.

And this is what the LLM spat out:

In the first article we introduced the reader to Conrad Williams*. [*Footnote: Not William, as given in the heading. Editors.] Dorothy, who proposes a "new" Socialism based on "new" methods and principles.

To clarify, there are no footnotes. The AI just totally made that up somehow.

Glownonymous 21-10-25 17:12:41 No.31591

>>31590
>no idea how to use that code or what it means
Well, that's a little too bad. I'm not exactly sure how to run it on Windows either. But it sounds like your existing solution is maybe good enough?

>true bizarre hallucinations
Yah, one of my other runs (maybe the one with the weaker model?) got it repeating "to my knowledge" a hundred times and monotonically decreasing the column width until it was just a word or part of one for one article.

OP 21-10-25 18:23:51 No.31592

>>31591
>But it sounds like your existing solution is maybe good enough?
I have dozens of hours of experience using it, which helps a lot. ABBYY also has a lot of PDF image editing tools built-in to deskew the text etc., so it looks like I'd be processing through ABBYY anyway before running the images through Tessaract or an LLM. Severely grainy or unevenly exposed images still give it a lot of trouble through with random apostrophes etc. that are a chore to manually remove.

I thank you again for transcribing the Slavit text for me though. It looks good enough that manually correcting it wouldn't be too bad. We'll see about that though, maybe the hallucinations will be so severe I'd have to verify every sentence.

The ultimate solution to this problem will be to get New York University to take the original crinkly newspapers out of storage and scan the broadsheets properly. Which the publication definitely deserves, but it's a huge ask and I'd like to be able finish my William English Walling complete works project (plus polemics aimed at him) before I dive into that.

Glownonymous 21-10-25 18:45:28 No.31593

>>31592
Good luck with your endeavors then.

>William English Walling
This seems like a very interesting character and project.

Unique IPs: 4