How to remove a link in a PDF that is found in a thousand pages

Discussion:

(too old to reply)

Andrew

2024-05-24 00:04:52 UTC

I have a PDF with a link in it of the form:
http://domain.com
in a million places (usually at the top, bottom or middle of a page that is
mostly empty - where all I want to do is delete it completely.

I want to delete those links, and the only PDF editor I know of that will
delete them easily is the Adobe Acrobat (writer) but it deletes them one by
one. Yuck. I'm doing that, but is there a better way?

Googling, I find that Calibre will delete them but oh my god, is that a
complicated action, where you have do css rules and crazy stuff like that.

You can't just search and replace for some godforsaken reason.

Hence I implore you for help... where the PDF can be easily converted to
any epub format if there's another way other than a PDF editor to do it.

Paul

2024-05-24 03:00:28 UTC

Permalink

Post by Andrew
http://domain.com
in a million places (usually at the top, bottom or middle of a page that is
mostly empty - where all I want to do is delete it completely.
I want to delete those links, and the only PDF editor I know of that will
delete them easily is the Adobe Acrobat (writer) but it deletes them one by
one. Yuck. I'm doing that, but is there a better way?
Googling, I find that Calibre will delete them but oh my god, is that a
complicated action, where you have do css rules and crazy stuff like that.
You can't just search and replace for some godforsaken reason.
Hence I implore you for help... where the PDF can be easily converted to
any epub format if there's another way other than a PDF editor to do it.

PDF files are normally "binary" in appearance. But they can be
translated to "ascii". Notice there is a gubbin near the top, which
is not ASCII, and that continues to make the file binary. For example,
some scripting you might do, might have an issue with the four binary
characters. (That binary thing, could be different on a different
version of PDF file.)

I don't know if this file has integrity or not. It's just
intended to show how simple the format could have been. (Normal files
will NOT be simple, so you can forget that right now.)

*********************** PDF in Text Mode ***********************
%PDF-1.1
%¥±ë

1 0 obj
<< /Type /Catalog
/Pages 2 0 R
endobj

2 0 obj
<< /Type /Pages
/Kids [3 0 R]
/Count 1
/MediaBox [0 0 300 144]
endobj

3 0 obj
<< /Type /Page
/Parent 2 0 R
/Resources
<< /Font
<< /F1
<< /Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Contents 4 0 R
endobj

4 0 obj
<< /Length 55 >>
stream
BT
/F1 18 Tf
0 0 Td
(Hello World) Tj
ET
endstream
endobj

xref
0 5
0000000000 65535 f
0000000018 00000 n
0000000077 00000 n
0000000178 00000 n
0000000457 00000 n
trailer
<< /Root 1 0 R
/Size 5
startxref
565
%%EOF
*********************** PDF in Text Mode ***********************

If you just delete the string in question, it's going to say
"this file is damaged".

The document has consistency checks, and that's how it can
tell the file has been edited.

You can tell from this, they were just screwing with us. The
format before this, PostScript, didn't have counters. When you
found a section in PostScript that said "Do not delete this section",
you just deleted it :-) Well, when they invented PDF, they messed
with it a bit, in the bomb-squad sense.

Adobe makes a "book" available about the PDF standard, and
you could use that. But that's a learning experience.

The only command of note, in my Notes file, is this, and I have
not placed any comments to tell me what it does :-) This makes
the ASCII-like flavor of file.

mutool.exe convert -F pdf -O decompress,clean -o output.pdf input.pdf

And when we talk of "binary to ascii", there is DEFINITELY binary
still in there. The commercial fonts can be encoded somehow, and they are
still transferred as a binary blob. If not handled properly, you will break
the fonts. This puts some constraints on how you work on the file, for sure.
I could use HxD for example, while keeping another tool open to better
be able to read the file as the ASCII portion.

There are various ways to obscure text in the document. Even in
"ASCII mode", nothing says you will see "https://www.something.com".
You might see bunches of numbers instead. If this string of yours
is intended as a watermark, then of course the file will be augmented
for maximum annoyance. A lot of the watermarks we played with as kids,
they were not hardened. You might have concluded nobody cared to do
a good job. I can assure you that some commercial tools, definitely
take their watermark design seriously.

Paul

Lawrence D'Oliveiro

2024-05-24 03:45:51 UTC

Permalink

... is there a better way?

Write a program using a PDF-manipulation toolkit.

I have had good results writing Python code using pikepdf
<https://github.com/pikepdf/pikepdf>.

Kingfisher

2024-05-24 06:00:35 UTC

Permalink

LibreOffice Writer will open PDF, edit, and export as PDF. It has a Find
and Replace function that can get all the links in one shot.

Herbert Kleebauer

2024-05-24 06:40:21 UTC

Permalink

If you have Acrobat, save the file as uncompressed pdf. If you are
lucky, you will find "http://domain.com" as simple text in the file.
Replace any occurrence with exactly the same number of blanks. But
you have to use an Editor which preserves the few binary bytes at
the beginning of the file.

Peter Johnson

2024-05-24 15:04:00 UTC

Permalink

How important is the formatting?
You could extract the text into a Word (or similar) file, run
find/exchange on it and then create a new PDF. Which might or might
not change the formatting, but you could probably fix that before you
created the new PDF.

Peter Flynn

2024-05-26 21:04:13 UTC

Permalink

I have in the past had good success by converting the document to
Postscript, finding the pattern of the offending links, and running a
stream editor, then converting back to PDF, eg

pdf2ps foo.pdf | sed -e "s+http://domain.com++g" | ps2pdf >foo2.pdf

Peter