

So no, no billion dollar company can make their own training data
This statement brought along with it the terrifying thought that there’s a dystopian alternative timeline where companies do make their own training data, by commissioning untold numbers of scientists, engineers, artists, researchers, and other specialties to undertake work that no one else has. But rather than trying to further the sum of human knowledge, or even directly commercializing the fruits of that research, that it’s all just fodder to throw into the LLM training set. A world where knowledge is not only gatekept like Elsevier but it isn’t even accessible by humans: only the LLM will get to read it and digest it for human consumption.
Written by humans, read by AI, spoonfed to humans. My god, what an awful world that would be.

The photos taken by the sorting machines are of the outside of the envelope, and are necessary in order to perform OCR of the destination address and to verify postage. There is no general mechanism to photograph the contents of mailpieces, and given how enormous the operations of the postal service is, casting a wide surveillance net to capture the contents of mailpieces is simply impractical before someone eventually spilled the beans.
That said, what you describe is a method of investigation known as mail cover, where the useful info from the outside of a recipient’s mail can be useful. For example, getting lots of mail from a huge number domestic addresses in plain envelopes, the sort that victims of remittance fraud would have on hand, could be a sign that the recipient is laundering fraudulent money. Alternatively, sometimes the envelope used by the sender is so thin that the outside photo accidentally reveals the contents. This is no different than holding up an envelope to the sunlight and looking through it. Obvious data is obvious to observe.
In electronic surveillance (a la NSA), looking at just the outside of an envelope is akin to recording only the metadata of an encrypted messaging app. No, you can’t read the messages, but seeing that someone received a 20 MB message could indicate a video, whereas 2 KB might just be one message in a rapid convo.