How to edit 1st instance of text in multiple htm files using batch command?

故事扮演 提交于 2019-12-02 10:34:17

The safest solution (albeit perhaps the slowest and most complicated) would be to parse your HTML files as HTML and remove the first paragraph from the DOM. This would give you the benefit of not being restricted to any sort of dependable formatting of the HTML source. Comments are properly skipped, line breaks are handled correctly, and life is all sunshine and daisies. Parsing the HTML DOM can be done using an InternetExplorer.Application COM object. Here's a batch / JScript hybrid example:

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in (*.html) do (
    cscript /nologo /e:JScript "%~f0" "%%~fI"
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.Echo(WSH.Arguments(0));

var fso = WSH.CreateObject('scripting.filesystemobject'),
    IE = WSH.CreateObject('InternetExplorer.Application'),
    htmlfile = fso.GetAbsolutePathName(WSH.Arguments(0));

IE.Visible = 0;
IE.Navigate('file:///' + htmlfile.replace(/\\/g, '/'));
while (IE.Busy || IE.ReadyState != 4) WSH.Sleep(25);

var p = IE.document.getElementsByTagName('p');

if (p && p[0]) {

    /* If you want to remove the surrounding <p></p> only
    while keeping the paragraph's inner content, uncomment
    the following line: */

    // while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

    p[0].parentNode.removeChild(p[0]);
    htmlfile = fso.CreateTextFile(htmlfile, 1);
    htmlfile.Write('<!DOCTYPE html>\n'
        + '<html>\n'
        + IE.document.documentElement.innerHTML
        + '\n</html>');
    htmlfile.Close();
}

IE.Quit();
try { while (IE && IE.Busy) WSH.Sleep(25); }
catch(e) {}

And because you're working with the DOM, additional tweaks are made easier. To delete the first <p> element within each <div> element (just as a wild example, not that anyone would ever want this ), navigate the DOM as you would in browser-based JavaScript.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in ("*.htm") do (
    echo Batch section: "%%~fI"
    cscript /nologo /e:JScript "%~f0" "%%~fI"
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.Echo('JScript section: "' + WSH.Arguments(0) + '"');

var fso = WSH.CreateObject('scripting.filesystemobject'),
    IE = WSH.CreateObject('InternetExplorer.Application'),
    htmlfile = fso.GetAbsolutePathName(WSH.Arguments(0)),
    changed;

IE.Visible = 0;
IE.Navigate('file:///' + htmlfile.replace(/\\/g, '/'));
while (IE.Busy || IE.ReadyState != 4) WSH.Sleep(25);

for (var d = IE.document.getElementsByTagName('div'), i = 0; i < d.length; i++) {

    var p = d[i].getElementsByTagName('p');
    if (p && p[0]) {

        // move contents of p node up to parent
        while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

        // delete now empty p node
        p[0].parentNode.removeChild(p[0]);
        changed = true;
    }
}

if (changed) {
    htmlfile = fso.CreateTextFile(htmlfile, 1);
    htmlfile.Write('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n'
        + '<HTML xmlns:t= "urn:schemas-microsoft-com:time" xmlns:control>\n'
        + IE.document.documentElement.innerHTML
        + '\n</HTML>');
    htmlfile.Close();
}

IE.Quit();
try { while (IE && IE.Busy) WSH.Sleep(25); }
catch(e) {}

The solution you were probably expecting, a pure batch solution, would involve a bunch of for loops. This example will strip the entire line(s) from the first <p> to the first </p>.

I'm sure npocmaka, MC ND, Aacini, jeb or dbenham can accomplish this with half the code and ten times the efficiency. *shrug*

This is the middle-of-the-road solution, offering more tolerance for line breaks within your <p> tag than the PowerShell regexp replacement, but not quite as safe as the InternetExplorer.Application COM object JScript hybrid.

@echo off
setlocal

for %%I in (*.html) do (

    set p_on_line=

    rem // get line number of first <p> tag
    for /f "tokens=1 delims=:" %%n in (
        'findstr /i /n "<p[^ar]" "%%~fI"'
    ) do if not defined p_on_line set "p_on_line=%%n"

    if defined p_on_line (

        rem // process file line-by-line
        setlocal enabledelayedexpansion
        for /f "delims=" %%L in ('findstr /n "^" "%%~fI"') do (
            call :split num line "%%L"

            rem // If <p> has not yet been reached, copy line to new file
            if !num! lss !p_on_line! (
                >>"%%~dpnI.new" echo(!line!
            ) else (
                rem // If </p> has been reached, resume writing.
                if not "!line!"=="!line:</p>=!" set p_on_line=2147483647
            )
        )
        endlocal
        if exist "%%~dpnI.new" move /y "%%~dpnI.new" "%%~fI" >NUL
    )
)

goto :EOF

:split <num_var> <line_var> <string>
setlocal disabledelayedexpansion
set "line=%~3"
for /f "tokens=1 delims=:" %%I in ("%~3") do set "num=%%I"
set "line=%line:*:=%"
endlocal & set "%~1=%num%" & set "%~2=%line%"
goto :EOF
@ECHO Off
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET "destdir=U:\destdir"
PUSHD "%sourcedir%"
FOR /f "delims=" %%f IN ('dir /b /a-d "q28443084*" ') DO ((
 SET "zap=<P>"
 FOR /f "usebackqdelims=" %%a IN ("%%f") DO (
  IF DEFINED zap (
   SET "line=%%a"
   CALL :process
   IF DEFINED keep (ECHO(%%a) ELSE (iF DEFINED line CALL ECHO(%%line%%)
  ) ELSE (ECHO(%%a)
 )
 )>"%destdir%\%%f"
)
popd

GOTO :EOF

:process
SET "keep="
CALL SET "line2=%%line:%zap%=%%"
IF "%line%" equ "%line2%" SET "keep=y"&GOTO :EOF
SET "line=%line2%"
IF "%zap%"=="</P>" SET "zap="&GOTO :EOF 
SET "zap=</P>"
IF NOT DEFINED line GOTO :EOF 
SET "line=%line2:</P>=%"
IF "%line%" neq "%line2%" SET "zap="
GOTO :eof

This may work - it will suppress empty lines.

I chose to process files matching the mask q28443084*in directory u:\sourcedir to matching filenames in u:\destdir - you would need to change these settings to suit.

The process revolves around the setting of zap, which may be set to either <P>, </P> or nothing. The incoming line is examined, and either kept as-is if it does not contain zap or is output in modified form and zap adjusted to the next value. if zap is nothing then just reproduce input to output.

rojo

The shortest solution would be to use a PowerShell one-liner.

powershell -command "gci '*.html' | %{ ([regex]'<p\W.*?</p>').replace([IO.File]::ReadAllText($_),'',1) | sc $_ }"

Please note that this will only work if there are no line breaks within the first paragraph. If there's a line break between <p> and </p> this will keep searching until it finds a paragraph that doesn't have a line break. You might be better off trying to fix the vendor's broken CSS than this hackish workaround.

Anyway, the command above roughly translates thusly:

  • In the current directory, get child items matching *.html
  • For each matching html file (the % is an alias for foreach-object):

    • Create a regex object matching from <p to shining </p>
    • Call that regex object's replace method with the following params:

      • use the HTML file contents as the haystack,
      • replace the needle with nothing,
      • and do this 1 time.
    • Set the content of the HTML file to be the result.

I used [IO.File]::ReadAllText($_) rather than gc $_ to preserve line breaks. Using get-content with [regex].replace mashes everything together into one line. I used a [regex] object rather than a simpler -replace switch because -replace is global.

Here's a similar solution to the HTML DOM answer. If your HTML is valid, you could try to parse it as XML. The advantage here is, where the InternetExplorer.Application COM object loads an entire fully-bloated instance of Internet Explorer for each page load, instead you're loading only a dll (msxml3.dll). This should hopefully handle multiple files more efficiently. The down side is that the XML parser is finicky about the validity of your tag structure. If, for example, you have an unordered list where the list items are not closed:

<ul>
    <li>Item 1
    <li>Item 2
</ul>

... a web browser would understand that just fine, but the XML parser will probably error. Anyway, it's worth a shot. I just tested this on a directory of 500 identical HTML files, and it worked through them in less than a minute.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in ("*.htm") do (
    cscript /nologo /e:JScript "%~f0" "%%~fI"
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.StdOut.Write('Checking ' + WSH.Arguments(0) + '... ');

var fso = WSH.CreateObject('scripting.filesystemobject'),
    DOM = WSH.CreateObject('Microsoft.XMLDOM'),
    htmlfile = fso.OpenTextFile(WSH.Arguments(0), 1),
    html = htmlfile.ReadAll().split(/<\/head\b.*?>/i),  
    head = html[0] + '</head>',
    body = html[1].replace(/<\/html\b.*?>/i,''),
    changed;

htmlfile.Close();

// attempt to massage body string into valid XHTML
var self_closing_tags = ['area','base','br','col',
    'command','comment','embed','hr','img','input',
    'keygen','link','meta','param','source','track','wbr'];

body = body.replace(/<\/?\w+/g, function(m) { return m.toLowerCase(); }).replace(
    RegExp([    // should match <br>
        '<(',
            '(' + self_closing_tags.join('|') + ')',
            '([^>]+[^\/])?',    // for tags with properties, tag is unclosed
        ')>'
    ].join(''), 'ig'), "<$1 />"
);  

DOM.loadXML(body);
DOM.async = false;

if (DOM.parseError.errorCode) {
   WSH.Echo(DOM.parseError.reason);
   WSH.Quit(0);
}

for (var d = DOM.documentElement.getElementsByTagName('div'), i = 0; i < d.length; i++) {

    var p = d[i].getElementsByTagName('p');
    if (p && p[0]) {

        // move contents of p node up to parent
        while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

        // delete now empty p node
        p[0].parentNode.removeChild(p[0]);
        changed = true;
    }
}

html = head + DOM.documentElement.xml + '</html>';

if (changed) {
    htmlfile = fso.CreateTextFile(WSH.Arguments(0), 1);
    htmlfile.Write(html);
    htmlfile.Close();
    WSH.Echo('Fixed!');
}
else WSH.Echo('Nothing to change.');
rojo

For posterity, I found another solution. O.P. was having problems with browser security and group policy restrictions preventing the InternetExplorer.Application COM object from behaving as expected, and the HTML he's fixing cannot reasonably be massaged into valid XML for the Microsoft.XMLDOM parser. But I'm optimistic that the htmlfile COM object won't suffer from these same infirmities.

As I emailed the O.P.:

Peppered around Google searches I found occasional references to a mysterious COM object called "htmlfile". It appears to be a way to build and interact with the HTML DOM without using the IE engine. I can't find any documentation on it on MSDN, but I managed to scrape together enough methods and properties from trial and error to make the script work.

I've since discovered that there's more to the htmlfile COM object than meets the eye -- htmlfileObj.parentWindow.clipboardData for example (MSDN reference).

Anyway, I was most optimistic about this solution, but O.P. has stopped returning my emails. Perhaps it'll be useful to someone else though.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

for %%I in ("*.htm") do cscript /nologo /e:JScript "%~f0" "%%~fI"

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

WSH.StdOut.Write(WSH.Arguments(0) + ': ');

var fso = WSH.CreateObject('scripting.filesystemobject'),
    DOM = WSH.CreateObject('htmlfile'),
    htmlfile = fso.OpenTextFile(WSH.Arguments(0), 1),
    html = htmlfile.ReadAll(),
    head = html.split(/<body\b.*?>/i)[0],
    bodyTag = html.match(/<body\b.*?>/i)[0],
    changed;

DOM.write(html);
htmlfile.Close();

if (DOM.getElementsByName('p_tag_fixed').length) {
    WSH.Echo('fix already applied.');
    WSH.Quit(0);
}

for (var d = DOM.body.getElementsByTagName('div'), i = 0; i < d.length; i++) {

    var p = d[i].getElementsByTagName('p');
    if (p && p[0]) {

        // move contents of p node up to parent
        while (p[0].hasChildNodes()) p[0].parentNode.insertBefore(p[0].firstChild, p[0]);

        // delete now empty p node
        p[0].parentNode.removeChild(p[0]);

        changed = true;
    }
}

if (changed) {
    htmlfile = fso.CreateTextFile(WSH.Arguments(0), 1);
    htmlfile.Write(
        head
        + '<meta name="p_tag_fixed" />'
        + bodyTag
        + DOM.body.innerHTML
        + '</body></html>'
    );
    htmlfile.Close();
    WSH.Echo('Fixed!')
}
else WSH.Echo('unchanged.');
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!