At work I have been given a task to organize some documents. Please help me, I’m overwhelmed.
TLDR: Manually separate PDFs into individual documents (variable page length), assign each document to a category, and identify the date of the document. Need a fast way to do it. On Windows.
Here is a longer description of my task.
Goals:
- Chop up the bulk PDFs to create 1 PDF per document
- Sort each of them into one of 10 categories
- Identify the date in each document. Include it in the filename. Optimally, insert it into the document itself (e.g. top right hand corner of first page, in the margin so it doesn’t cover anything)
- Would be nice:
- Rerun the OCR; I think it could be better. At home I would use ocrmypdf
- Clean up the scans: fix alignment, remove artifacts etc (only if effort is trivial and 0% risk of data loss)
Inputs:
- Printed out, it’s a stack of papers about 6-8cm tall
- Has been provided in bulk PDFs about 40-60 pages each
- Most individual documents are 2-5 pages in length, with some being 10-20
Document characteristics:
- Scanned on an actual scanner; some carefully, others not
- Have been optimized for small smile size
- Mostly black & white, some grey scale
- Business documents, records, official correspondence etc. Typed, not handwriting. English.
- They are all in some sort of standardized format, but from many different sources, each with their own format
- Have had some sort of OCR applied to them; it isn’t very good especially when the scans aren’t perfect.
Work environment and constraints
- Because I am at work, I am using a standard Windows workstation that is set up for office (not developer) use.
- I have asked for the full version of Adobe Acrobat to be installed, because they have a license for that. I’ve never used it. Maybe it will do all of this but based on how profoundly annoying Reader is, I am skeptical.
- I can install things that don’t require administrator privileges. I can ask for administrator to install something if I am reasonably confident it’ll be useful and safe but I can’t be annoying asking for things all the time. I strongly prefer open source tools.
- Cannot under any circumstances use anything online, cloud, external AI. All data must stay local.
- I might be able to justify using linux in the future if this is an ongoing task (it might be) so linux-only suggestions are welcome but won’t be implemented first-line. Desktop applications > self hosted servers or command line.


Am I understanding this that you have some number of .pdf files that are 40-60 pages each and within them, there exist documents varying from 2 to 10 pages, and your task is to parse them?
If so, how many .PDFs are we talking here? You said on paper it’s like 8cm high? So like 1000 pages ish?
So like 20 .PDFs?
Just… Do it, dawg. The amount of time you spent on this post you coulda finished 1 of those PDFs. That’s 5% of the task. Do that 19 more times.
If you’re asking “can Adobe Acrobat break a 60 page .pdf into some number of .PDFs based on page numbers I tell it to?” the answer is yes.
Breathe a bit friend. No need to get overwhelmed.