Post by Andrewhttp://domain.com
in a million places (usually at the top, bottom or middle of a page that is
mostly empty - where all I want to do is delete it completely.
I want to delete those links, and the only PDF editor I know of that will
delete them easily is the Adobe Acrobat (writer) but it deletes them one by
one. Yuck. I'm doing that, but is there a better way?
Googling, I find that Calibre will delete them but oh my god, is that a
complicated action, where you have do css rules and crazy stuff like that.
You can't just search and replace for some godforsaken reason.
Hence I implore you for help... where the PDF can be easily converted to
any epub format if there's another way other than a PDF editor to do it.
PDF files are normally "binary" in appearance. But they can be
translated to "ascii". Notice there is a gubbin near the top, which
is not ASCII, and that continues to make the file binary. For example,
some scripting you might do, might have an issue with the four binary
characters. (That binary thing, could be different on a different
version of PDF file.)
I don't know if this file has integrity or not. It's just
intended to show how simple the format could have been. (Normal files
will NOT be simple, so you can forget that right now.)
*********************** PDF in Text Mode ***********************
%PDF-1.1
%¥±ë
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
endobj
2 0 obj
<< /Type /Pages
/Kids [3 0 R]
/Count 1
/MediaBox [0 0 300 144]
endobj
3 0 obj
<< /Type /Page
/Parent 2 0 R
/Resources
<< /Font
<< /F1
<< /Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Contents 4 0 R
endobj
4 0 obj
<< /Length 55 >>
stream
BT
/F1 18 Tf
0 0 Td
(Hello World) Tj
ET
endstream
endobj
xref
0 5
0000000000 65535 f
0000000018 00000 n
0000000077 00000 n
0000000178 00000 n
0000000457 00000 n
trailer
<< /Root 1 0 R
/Size 5
startxref
565
%%EOF
*********************** PDF in Text Mode ***********************
If you just delete the string in question, it's going to say
"this file is damaged".
The document has consistency checks, and that's how it can
tell the file has been edited.
You can tell from this, they were just screwing with us. The
format before this, PostScript, didn't have counters. When you
found a section in PostScript that said "Do not delete this section",
you just deleted it :-) Well, when they invented PDF, they messed
with it a bit, in the bomb-squad sense.
Adobe makes a "book" available about the PDF standard, and
you could use that. But that's a learning experience.
The only command of note, in my Notes file, is this, and I have
not placed any comments to tell me what it does :-) This makes
the ASCII-like flavor of file.
mutool.exe convert -F pdf -O decompress,clean -o output.pdf input.pdf
And when we talk of "binary to ascii", there is DEFINITELY binary
still in there. The commercial fonts can be encoded somehow, and they are
still transferred as a binary blob. If not handled properly, you will break
the fonts. This puts some constraints on how you work on the file, for sure.
I could use HxD for example, while keeping another tool open to better
be able to read the file as the ASCII portion.
There are various ways to obscure text in the document. Even in
"ASCII mode", nothing says you will see "https://www.something.com".
You might see bunches of numbers instead. If this string of yours
is intended as a watermark, then of course the file will be augmented
for maximum annoyance. A lot of the watermarks we played with as kids,
they were not hardened. You might have concluded nobody cared to do
a good job. I can assure you that some commercial tools, definitely
take their watermark design seriously.
Paul