Automating an atypical search & replace

Discussion:

Automating an atypical search & replace

(too old to reply)

Richard Owlett

2024-07-13 16:08:48 UTC

I'm reformatting some HTML files containing chapters of the KJV Bible.
My source follows the practice of italicizing some words.
I find italics distracting.

These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>

I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".
Obviously it would not be wise to fully automate the action.
I wish to find all occurrences of <span
class='add'>arbitrary_text</span> an manually confirm the edit.

In general, is it feasible?
Can KDE's Kate do it?

TIA

Janis Papanagnou

2024-07-13 17:48:57 UTC

Post by Richard Owlett
I'm reformatting some HTML files containing chapters of the KJV Bible.
My source follows the practice of italicizing some words.
I find italics distracting.
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".
Obviously it would not be wise to fully automate the action.
I wish to find all occurrences of <span
class='add'>arbitrary_text</span> an manually confirm the edit.
In general, is it feasible?

Yes, sure.

Some remarks...
I would use Regular Expressions (RE) for that task.
If <span> sections can be nested in your HTML source then you
cannot do that with plain RE processors.
Since you want to inspect each <span> pattern individually it's
not clear what you mean by "automate" (which I'd interpret as
running a batch job to do the process).
Actually you seem to want a sequential find + replace-or-skip.

In Vim I'd search for the "<span ..." pattern and then delete
to the next "</span>" pattern. (Assuming no nested <span>.)
Rinse repeat.
That could be (for example) the commands [case 1]

/<span class='add'>
d/<\/span>df>

If there's no other <...> inside the span-sections you could
simplify that to [case 2]

/<span class='add'>
d2f>

with the opportunity to repeat those search+delete commands
by simply typing n. for every match, like n.n.n.n. or if
you want to skip some like, e.g., n.nnnn.n.nnn.n

With n you get to the next span pattern and . repeats the
last command.

In [case 1] the repeat isn't possible since we have two delete
operations d/<\/span> and df> , but here you can define
macros to trigger the command by a keystroke or just use the
recording function to repeat the once recorded commands.

Sounds complicated? - Maybe. - But if we know your exact data
format we can provide the best command sequence for Vim for
most easy use.

Post by Richard Owlett
Can KDE's Kate do it?

Don't know.

Janis

Post by Richard Owlett
TIA

Janis Papanagnou

2024-07-13 17:55:35 UTC

Please ignore my previous post - it would delete the whole span'ed
section!

It just occurred to me you'd probably want something like

/<span class='add'>
df>
/<\/span>
df>

And if you're using recording of the commands (I'll provide code
on demand) just repeat the recordings. You can also just use the
arrow keys after typing / to get the previous search patterns
if you like.

Post by Janis Papanagnou

Post by Richard Owlett
I'm reformatting some HTML files containing chapters of the KJV Bible.
My source follows the practice of italicizing some words.
I find italics distracting.
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".
Obviously it would not be wise to fully automate the action.
I wish to find all occurrences of <span
class='add'>arbitrary_text</span> an manually confirm the edit.
In general, is it feasible?

Yes, sure.
Some remarks...
I would use Regular Expressions (RE) for that task.
If <span> sections can be nested in your HTML source then you
cannot do that with plain RE processors.
Since you want to inspect each <span> pattern individually it's
not clear what you mean by "automate" (which I'd interpret as
running a batch job to do the process).
Actually you seem to want a sequential find + replace-or-skip.
In Vim I'd search for the "<span ..." pattern and then delete
to the next "</span>" pattern. (Assuming no nested <span>.)
Rinse repeat.
That could be (for example) the commands [case 1]
/<span class='add'>
d/<\/span>df>
If there's no other <...> inside the span-sections you could
simplify that to [case 2]
/<span class='add'>
d2f>
with the opportunity to repeat those search+delete commands
by simply typing n. for every match, like n.n.n.n. or if
you want to skip some like, e.g., n.nnnn.n.nnn.n
With n you get to the next span pattern and . repeats the
last command.
In [case 1] the repeat isn't possible since we have two delete
operations d/<\/span> and df> , but here you can define
macros to trigger the command by a keystroke or just use the
recording function to repeat the once recorded commands.
Sounds complicated? - Maybe. - But if we know your exact data
format we can provide the best command sequence for Vim for
most easy use.

Post by Richard Owlett
Can KDE's Kate do it?

Don't know.
Janis

Post by Richard Owlett
TIA

Richard Owlett

2024-07-14 07:33:25 UTC

Post by Janis Papanagnou
Please ignore my previous post - it would delete the whole span'ed
section!
It just occurred to me you'd probably want something like
/<span class='add'>
df>
/<\/span>
df>
And if you're using recording of the commands (I'll provide code
on demand) just repeat the recordings. You can also just use the
arrow keys after typing / to get the previous search patterns
if you like.

I don't know how to parse your answer.
But I suspect following some leads from Lawrence and Stan in this thread
will be illuminating. I have just started reading
https://docs.kde.org/stable5/en/kate/katepart/regular-expressions.html .

Part of my motive for this project is self education.

Post by Janis Papanagnou

Post by Janis Papanagnou

Post by Richard Owlett
I'm reformatting some HTML files containing chapters of the KJV Bible.
My source follows the practice of italicizing some words.
I find italics distracting.
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".
Obviously it would not be wise to fully automate the action.
I wish to find all occurrences of <span
class='add'>arbitrary_text</span> an manually confirm the edit.
In general, is it feasible?

Yes, sure.
Some remarks...
I would use Regular Expressions (RE) for that task.
If <span> sections can be nested in your HTML source then you
cannot do that with plain RE processors.
Since you want to inspect each <span> pattern individually it's
not clear what you mean by "automate" (which I'd interpret as
running a batch job to do the process).
Actually you seem to want a sequential find + replace-or-skip.
In Vim I'd search for the "<span ..." pattern and then delete
to the next "</span>" pattern. (Assuming no nested <span>.)
Rinse repeat.
That could be (for example) the commands [case 1]
/<span class='add'>
d/<\/span>df>
If there's no other <...> inside the span-sections you could
simplify that to [case 2]
/<span class='add'>
d2f>
with the opportunity to repeat those search+delete commands
by simply typing n. for every match, like n.n.n.n. or if
you want to skip some like, e.g., n.nnnn.n.nnn.n
With n you get to the next span pattern and . repeats the
last command.
In [case 1] the repeat isn't possible since we have two delete
operations d/<\/span> and df> , but here you can define
macros to trigger the command by a keystroke or just use the
recording function to repeat the once recorded commands.
Sounds complicated? - Maybe. - But if we know your exact data
format we can provide the best command sequence for Vim for
most easy use.

Post by Richard Owlett
Can KDE's Kate do it?

Don't know.
Janis

Post by Richard Owlett
TIA

Janis Papanagnou

2024-07-14 08:43:48 UTC

Post by Richard Owlett

Post by Janis Papanagnou
Please ignore my previous post - it would delete the whole span'ed
section!
It just occurred to me you'd probably want something like
/<span class='add'>
df>
/<\/span>
df>
And if you're using recording of the commands (I'll provide code
on demand) just repeat the recordings. You can also just use the
arrow keys after typing / to get the previous search patterns
if you like.

I don't know how to parse your answer.

What I meant is that if you're doing some editing tasks or editing
commands repeatedly you certainly want to avoid typing them over
and aver again. There's a couple methods to achieve that in the Vim
editor. One method is using the editor's history functions that
make it possible to access (for example) previous search patterns.
Another one in Vim is to record the commands to be able to replay
them whenever you want with simple keystrokes.

The maybe cryptic appearing commands I gave are the Vim commands
for the task you had described:

/ searches for the regular expression pattern following
df> deletes the text up to the tag-terminating '>' symbol

Post by Richard Owlett
But I suspect following some leads from Lawrence and Stan in this thread
will be illuminating. I have just started reading
https://docs.kde.org/stable5/en/kate/katepart/regular-expressions.html .
Part of my motive for this project is self education.

Fair enough. It's not clear to me what exactly you want to learn.
Using the Kate editor, learning how to write Regular Expressions,
how to efficiently edit texts, or how to handle/edit HTML files
to make them readable for your purposes?

If it's the latter than the right way to do that is (as already
said in my [OT] reply or as also Stan suggested) to just fix the
CSS definition, if that's the place where the 'italic' property
had been defined. (If, OTOH, your HTML code contains, e.g. lots
of <i> tags then you'd have to handle/edit them individually.)

It has also been mentioned already that HTML structures can not
sensibly handled by regular expressions. - So you learned that
already. - But for non-nested HTML sub-structures it could be
achievable anyway.

To learn the Kate editor I'd suppose there's a description or
manual available.

Janis

Janis Papanagnou

2024-07-13 19:18:01 UTC

Post by Richard Owlett
I'm reformatting some HTML files containing chapters of the KJV Bible.
My source follows the practice of italicizing some words.
I find italics distracting.
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>

It just occurred to me that if you say the italic text entities are
the text objects in this span clause then the italic text-decoration
is likely defined as a CSS attribute of the respective CSS class.
That would in your example mean the class "add". Since you generally
don't seem to like italics it would be easier - and also the usual
way to tackle such a text - to change the single CSS attribute of
the class. You find it in the CSS section of the header file or in
a file with the CSS definition that is referenced in the HTML file.
Look out for a line like "font-style: italic; and remove that.

Janis

Post by Richard Owlett
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".
Obviously it would not be wise to fully automate the action.
I wish to find all occurrences of <span
class='add'>arbitrary_text</span> an manually confirm the edit.
In general, is it feasible?
Can KDE's Kate do it?
TIA

Lawrence D'Oliveiro

2024-07-13 23:39:14 UTC

Post by Richard Owlett
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".

This is beyond the abilities of regular expressions. This is the point
where you need to use an actual HTML/XML-parsing library.

See also
<https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags>.

Stan Brown

2024-07-14 06:13:55 UTC

Post by Lawrence D'Oliveiro

Post by Richard Owlett
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".

This is beyond the abilities of regular expressions. This is the point
where you need to use an actual HTML/XML-parsing library.

In general I'd agree with you. But the OP made a big deal -- in a
different thread, for some reason -- about wanting to use minimal
HTML, so I doubt very much there will be nested <span> ... </span>
sequences.

Also, the OP quite rightly wanted to confirm each change before it is
made, so presumably if there are any nested sequences he will say no
to that particular edit and fix it manually.

--
Stan Brown, Tehachapi, California, USA https://BrownMath.com/
Shikata ga nai...

Richard Owlett

2024-07-14 07:51:45 UTC

Post by Stan Brown

Post by Lawrence D'Oliveiro

Post by Richard Owlett
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".

This is beyond the abilities of regular expressions. This is the point
where you need to use an actual HTML/XML-parsing library.

In general I'd agree with you. But the OP made a big deal -- in a
different thread, for some reason -- about wanting to use minimal
HTML, so I doubt very much there will be nested <span> ... </span>
sequences.

I'd compare using a minimal HTML to learning to crawl before pursuing
running a marathon ;}

Post by Stan Brown
Also, the OP quite rightly wanted to confirm each change before it is
made, so presumably if there are any nested sequences he will say no
to that particular edit and fix it manually.

Richard Owlett

2024-07-14 07:47:02 UTC

Post by Lawrence D'Oliveiro

Post by Richard Owlett
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".

This is beyond the abilities of regular expressions. This is the point
where you need to use an actual HTML/XML-parsing library.
See also
<https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags>.

Thank you for the reference. Also I've begun perusing
https://docs.kde.org/stable5/en/kate/katepart/regular-expressions.html .
One of my motivations for this project is education.

Stan Brown

2024-07-14 06:08:54 UTC

Post by Richard Owlett
I'm reformatting some HTML files containing chapters of the KJV Bible.
My source follows the practice of italicizing some words.
I find italics distracting.
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".
Obviously it would not be wise to fully automate the action.
I wish to find all occurrences of <span
class='add'>arbitrary_text</span> an manually confirm the edit.
In general, is it feasible?

Yes, of course. Any editor above the level of Notepad ought to be
able to do this. (Sadly, a lot of editors are not above the level of
Notepad.)

For instance, in Vim you would use this command after opening the
file:

:%s;<span class='add'>\([^<]*\)</span>;\1;gc

% = process every line of the file
\( ... \) makes that part of the pattern match addressable
[^>]* matches a string of characters not including a <. If there is
other HTML between span and /span, it will not match.
\1 = the text found between span and /span
gc = do every occurrence on each line, but confirm each one

Post by Richard Owlett
Can KDE's Kate do it?

I've no idea.

But there's an easier solution. Change the definition of class add in
your style sheet:

span.add { font-style:normal; }

Then you won't have to edit the HTML at all.

--
Stan Brown, Tehachapi, California, USA https://BrownMath.com/
Shikata ga nai...

Richard Owlett

2024-07-14 08:02:12 UTC

Post by Stan Brown

Post by Richard Owlett
I'm reformatting some HTML files containing chapters of the KJV Bible.
My source follows the practice of italicizing some words.
I find italics distracting.
These occurrences are consistently of the form
<span class='add'>arbitrary_text</span>
I wish to delete "<span class='add'>" and *ASSOCIATED* "</span>".
Obviously it would not be wise to fully automate the action.
I wish to find all occurrences of <span
class='add'>arbitrary_text</span> an manually confirm the edit.
In general, is it feasible?

Yes, of course. Any editor above the level of Notepad ought to be
able to do this. (Sadly, a lot of editors are not above the level of
Notepad.)
For instance, in Vim you would use this command after opening the
:%s;<span class='add'>\([^<]*\)</span>;\1;gc
% = process every line of the file
\( ... \) makes that part of the pattern match addressable
[^>]* matches a string of characters not including a <. If there is
other HTML between span and /span, it will not match.
\1 = the text found between span and /span
gc = do every occurrence on each line, but confirm each one

I'll use parsing that expression as a guide to understanding
https://docs.kde.org/stable5/en/kate/katepart/regular-expressions.html .

Post by Stan Brown

Post by Richard Owlett
Can KDE's Kate do it?

I've no idea.

I'm gaining an appreciation of just how much HTML Kate can handle.
Its highlighting feature begins to serve for minimal syntax checking.

Post by Stan Brown
But there's an easier solution. Change the definition of class add in
span.add { font-style:normal; }
Then you won't have to edit the HTML at all.

Learning CSS is beyond my current goals.

Lawrence D'Oliveiro

2024-07-14 21:15:44 UTC

Post by Richard Owlett
Learning CSS is beyond my current goals.

CSS is essentially an indispensable part of HTML at this point. If it
saves you effort, why not use it?

Richard Owlett

2024-07-14 21:48:26 UTC

Post by Lawrence D'Oliveiro

Post by Richard Owlett
Learning CSS is beyond my current goals.

CSS is essentially an indispensable part of HTML at this point. If it
saves you effort, why not use it?

At 80 I pursue what's interesting ;}
When I set personal goals for for the spec of my project I decided on
doing it in a small as possible sub-set of HTML 2.0 .

Lawrence D'Oliveiro

2024-07-15 01:25:30 UTC

Post by Richard Owlett

Post by Lawrence D'Oliveiro

Post by Richard Owlett
Learning CSS is beyond my current goals.

CSS is essentially an indispensable part of HTML at this point. If it
saves you effort, why not use it?

At 80 I pursue what's interesting ;}
When I set personal goals for for the spec of my project I decided on
doing it in a small as possible sub-set of HTML 2.0 .

To me, that’s like spending your weekends rebuilding a Morris Minor.

Richard Owlett

2024-07-15 04:29:07 UTC

Post by Lawrence D'Oliveiro

Post by Richard Owlett

Post by Lawrence D'Oliveiro

Post by Richard Owlett
Learning CSS is beyond my current goals.

CSS is essentially an indispensable part of HTML at this point. If it
saves you effort, why not use it?

At 80 I pursue what's interesting ;}
When I set personal goals for for the spec of my project I decided on
doing it in a small as possible sub-set of HTML 2.0 .

To me, that’s like spending your weekends rebuilding a Morris Minor.

Though I've never seen one, if I were mechanically inclined and a ocean
away I could see that.
q.v. https://www.mmoc.org.uk/ says "The MMOC exists to unite these
people who have a fondness of these loveable jellymoulds, and those
people who still use them as everyday transport."

There is even a doctoral thesis on knowledge for its own sake :}!
https://academiccommons.columbia.edu/doi/10.7916/d8-eme0-my23

candycanearter07

2024-07-15 15:30:06 UTC

Post by Lawrence D'Oliveiro

Post by Richard Owlett
Learning CSS is beyond my current goals.

CSS is essentially an indispensable part of HTML at this point. If it
saves you effort, why not use it?

It is kinda hard for me to get a good looking website up..

--
user <candycane> is generated from /dev/urandom

Lawrence D'Oliveiro

2024-07-15 21:59:36 UTC

Post by candycanearter07

Post by Lawrence D'Oliveiro
CSS is essentially an indispensable part of HTML at this point. If it
saves you effort, why not use it?

It is kinda hard for me to get a good looking website up..

MDN is a good resource on all things Web, including CSS.

<https://developer.mozilla.org/en-US/docs/Web>

Richard Owlett

2024-07-16 01:35:02 UTC

Post by Lawrence D'Oliveiro

Post by candycanearter07

Post by Lawrence D'Oliveiro
CSS is essentially an indispensable part of HTML at this point. If it
saves you effort, why not use it?

It is kinda hard for me to get a good looking website up..

MDN is a good resource on all things Web, including CSS.
<https://developer.mozilla.org/en-US/docs/Web>

Appears to have useful content.
Needs at least a "Table of Contents".
An "Index" would likely be useful.

A problem of much tech documentation.
[Seen much of it in last half century. Been told I "write like an
engineer". Once by an English prof whose son was one.]

Lawrence D'Oliveiro

2024-07-16 02:46:08 UTC

Post by Richard Owlett

Post by Lawrence D'Oliveiro
MDN is a good resource on all things Web, including CSS.
<https://developer.mozilla.org/en-US/docs/Web>

Appears to have useful content.
Needs at least a "Table of Contents".

That page has the links to the various contents.

Richard Owlett

2024-07-16 11:17:46 UTC

Post by Lawrence D'Oliveiro

Post by Richard Owlett

Post by Lawrence D'Oliveiro
MDN is a good resource on all things Web, including CSS.
<https://developer.mozilla.org/en-US/docs/Web>

Appears to have useful content.
Needs at least a "Table of Contents".

That page has the links to the various contents.

Those do not make a "Table of Contents"!

See
https://researchmethod.net/table-of-contents/
especially
https://researchmethod.net/table-of-contents/#Importance_of_Table_of_Contents

Lawrence D'Oliveiro

2024-07-16 23:49:09 UTC

Post by Richard Owlett
Those do not make a "Table of Contents"!

It’s a table. It has the contents. Ergo, “table of contents”.

candycanearter07

2024-07-16 13:50:03 UTC

Post by Lawrence D'Oliveiro

Post by candycanearter07

Post by Lawrence D'Oliveiro
CSS is essentially an indispensable part of HTML at this point. If it
saves you effort, why not use it?

It is kinda hard for me to get a good looking website up..

MDN is a good resource on all things Web, including CSS.
<https://developer.mozilla.org/en-US/docs/Web>

alright..

--
user <candycane> is generated from /dev/urandom

22 Replies
4 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Richard Owlett 2024-07-13 16:08:48 UTC

Janis Papanagnou 2024-07-13 17:48:57 UTC

Janis Papanagnou 2024-07-13 17:55:35 UTC

Richard Owlett 2024-07-14 07:33:25 UTC

Janis Papanagnou 2024-07-14 08:43:48 UTC

Janis Papanagnou 2024-07-13 19:18:01 UTC

Lawrence D'Oliveiro 2024-07-13 23:39:14 UTC

Stan Brown 2024-07-14 06:13:55 UTC

Richard Owlett 2024-07-14 07:51:45 UTC

Richard Owlett 2024-07-14 07:47:02 UTC

Stan Brown 2024-07-14 06:08:54 UTC

Richard Owlett 2024-07-14 08:02:12 UTC

Lawrence D'Oliveiro 2024-07-14 21:15:44 UTC

Richard Owlett 2024-07-14 21:48:26 UTC

Lawrence D'Oliveiro 2024-07-15 01:25:30 UTC

Richard Owlett 2024-07-15 04:29:07 UTC

candycanearter07 2024-07-15 15:30:06 UTC

Lawrence D'Oliveiro 2024-07-15 21:59:36 UTC

Richard Owlett 2024-07-16 01:35:02 UTC

Lawrence D'Oliveiro 2024-07-16 02:46:08 UTC

Richard Owlett 2024-07-16 11:17:46 UTC

Lawrence D'Oliveiro 2024-07-16 23:49:09 UTC

candycanearter07 2024-07-16 13:50:03 UTC

about - legalese

Loading...