问题
I have a heavy XML file of 1Go having the following structure:
<?xml version='1.0' encoding='windows-1252'?>
<ext:BookingExtraction>
<Booking><Code>2016Z00258</Code><Advertiser><Code>00123</Code<Name>LOUIS VUITTON</Name></Advertiser></Booking>
<Booking><Code>2016Z00259</Code><Advertiser><Code>00124</Code<Name>Adidas</Name></Advertiser></Booking>
</ext:BookingExtraction>
As the structure is really simple my goal is to get the 150 last lines of an XML file copy them into new file and add the opening tag in the first line to have a well formed XML.
The algorithm works fine but some line having more than 65 536 characters are splitted in several lines. I read that DOS limit the number of character per line at 65 536. This is why it add a carriage enter character after this 65 536 characters.
The result is that the final XML is not well formed because of the carriage enter in the middle of the line. For instance:
<ext:BookingExtraction>
<Booking><Code>2016Z00258</Code><Advertiser><Code>00123</Code><Name>LOUIS VUIT
TON</Name></Advertiser></Booking>
</ext:BookingExtraction>
I tried to remove the characters carriage enter but it does not work. Do you have any idea how could I fix this?
`@echo off
setLocal EnableDelayedExpansion
::Get XML file
for /r %%a in (extractedBookings_BookingWithoutUnitsContent_PRD_*.xml) do (
::echo "%%~dpa" and full path is "%%~nxa"
set fileName="%%~nxa"
)
::Get the 150 last line of the file
echo File path: "%fileName%"
for /f %%i in ('find /v /c "" ^< "%fileName%"') do set /a lines=%%i
echo nb lines: "%lines%"
set /a startLine=%lines% - 150
echo Start line "%startLine%"
more /e +%startLine% "%fileName%" > extractedBookings_BookingWithoutUnitsContent_PRD.xml
::adding opening tag to the new file
echo ^<?xml version='1.0' encoding='windows-1252'?^> > newFile.xml
echo ^<ext:BookingExtraction^> >> newFile.xml
::Get the final file
type extractedBookings_BookingWithoutUnitsContent_PRD.xml >> newFile.xml
type newFile.xml > extractedBookings_BookingWithoutUnitsContent_PRD.xml`
Thank you in advance
回答1:
Your question is confusing; the "DOS limit the number of line at 65 536 characters" phrase is imprecise. When the output of more command is redirected to a disk file, it waits for a character after 65536 lines, and such character is inserted in the output. Also, the max line length in FIND command is 1070 characters (accordingly to this site), so I guess that your file have shorter lines. You just need a method that can cleanly output more than 64K lines.
The solution below is basically your same code, but it uses a combination of set /P command to skip the first lines and a findstr command to show the rest, instead of your more +%startLine% command.
@echo off
setLocal EnableDelayedExpansion
::Get XML file
for /r %%a in (extractedBookings_BookingWithoutUnitsContent_PRD_*.xml) do (
::echo "%%~dpa" and full path is "%%~nxa"
set fileName="%%~nxa"
)
::Get the 150 last line of the file
echo File path: "%fileName%"
for /f %%i in ('find /v /c "" ^< "%fileName%"') do set /a lines=%%i
echo nb lines: "%lines%"
set /a startLine=%lines% - 150
echo Start line "%startLine%"
REM Use a code block to read from redirected input file (and write to output file)
< "%fileName%" (
rem adding opening tag to the new file
echo ^<?xml version='1.0' encoding='windows-1252'?^>
echo ^<ext:BookingExtraction^>
REM Skip the first total-150 lines
for /L %%i in (1,1,%startLine%) do set /P "="
REM Copy the rest
findstr "^"
) > extractedBookings_BookingWithoutUnitsContent_PRD.xml
This method may still fail if an input line is longer than 1023 characters, because this is the limit of set /P command.
回答2:
As I commented earlier, 'tis better to parse XML as a hierarchical structure, rather than as predictably-formatted flat text. If that flat text is beautified, uglified, minified, whatever, a flat text scraper will fail.
Your example XML is still a little ambiguous, so I'm assuming you've got a single <ext:BookingExtraction> tag with a ton of <Booking> child nodes you wish to whittle down to the last 150.
Before your example XML can be parsed, though, (besides fixing the missing > in </code>) we need to massage it slightly by defining the namespace to which ext belongs.
Before:
<ext:BookingExtraction>
After:
<ext:BookingExtraction xmlns:ext="http://localhost">
Although strictly speaking that's probably a bogus namespace, it's good enough to make the XML parse-able nevertheless. We can do this programmatically by reading the XML into a variable and performing a regex replace. After that, it's just a simple matter of removing child nodes within a while loop until you reach your 150-element goal.
Save this with a .bat extension, replace "test.xml" with the location of your XML file, and run it.
@if (@CodeSection == @Batch) @then
@echo off & setlocal
cscript /nologo /e:JScript "%~f0" "test.xml" "output.xml"
goto :EOF
@end // end Batch / begin JScript hybrid code
var args = { infile: WSH.Arguments(0), outfile: WSH.Arguments(1) },
fso = WSH.CreateObject('Scripting.FileSystemObject'),
file = fso.OpenTextFile(args.infile, 1),
xml = file.ReadAll(),
DOM = WSH.CreateObject('MSXML2.DOMDocument.6.0'),
ns = 'xmlns:ext="http://localhost"',
xpath = '/ext:BookingExtraction/Booking';
file.Close();
DOM.loadXML(xml.replace(
/<(ext:BookingExtraction)>/i,
function($0, $1) { return '<' + $1 + ' ' + ns + '>' }
));
if (DOM.parseError.errorCode) {
var e = DOM.parseError;
WSH.StdErr.WriteLine('Error in ' + args.infile + ' line ' + e.line + ' char '
+ e.linepos + ':\n' + e.reason + '\n' + e.srcText);
WSH.Quit(1);
}
DOM.setProperty('SelectionNamespaces', ns);
while (DOM.selectNodes(xpath).length > 150) {
var node = DOM.selectSingleNode(xpath)
node.parentNode.removeChild(node)
}
DOM.save(args.outfile)
... Or it might be a little easier just to strip out the ext: namespace and replace it later. Here's a batch + PowerShell hybrid script that demonstrates. It's not as fast as the batch + Jscript hybrid, and it has a side effect of beautifying all tags whether you want them indented or not. But it does have the advantage of simplicity.
<# : batch portion
@echo off & setlocal
set "infile=test.xml"
set "outfile=out.xml"
powershell -noprofile "iex (${%~f0} | out-string)"
goto :EOF
: end batch / begin PowerShell hybrid #>
[xml]$xml = (gc $env:infile) -replace "ext:"
$xpath = "/BookingExtraction/Booking"
$deleted = 0
while ($xml.selectNodes($xpath).Count -gt 150) {
$node = $xml.selectSingleNode($xpath)
[void]$node.parentNode.removeChild($node)
$deleted++
}
write-host "Removed $deleted ndoes" -f magenta
$xml.save($env:outfile)
(gc $env:outfile) -replace "BookingExtraction", "ext:BookingExtraction" | sc $env:outfile
Edit: if dealing with large files (1GB+), maybe it would actually be better to trim the fat as flat text, rather than manipulating as structured object data. If you want the last 150 lines, I think it'd be more efficient to start at the bottom and work backwards, rather than starting at the top and skipping millions of lines. Opening the XML file with .NET methods will allow you to seek to the end of the file nearly instantly, then walk up. Try this batch + PowerShell script and see whether it works more efficiently for you:
<# : batch portion
@echo off & setlocal
set "infile=test.xml"
set "outfile=out.xml"
powershell -noprofile "iex (${%~f0} | out-string)"
goto :EOF
: end batch / begin PowerShell hybrid #>
$lines = 150
$found = 0
$reader = new-object IO.StreamReader((gi $env:infile).FullName)
$stream = $reader.BaseStream
$xml = $reader.ReadLine(), $reader.ReadLine()
$pos = $stream.Seek(0, [IO.SeekOrigin]::End)
while ($found -le $lines) {
$reader.DiscardBufferedData()
$stream.Position = --$pos
$char = $reader.Peek()
if ($char -eq -1) { break }
else { if ($char -eq 10) { $found++ } }
}
$reader.DiscardBufferedData()
$stream.Position = ++$pos
$xml += $reader.ReadToEnd()
$reader.Close()
$xml -join "`r`n" | out-file $env:outfile
来源:https://stackoverflow.com/questions/35702905/batch-dos-copying-last-lines-of-a-file-limited-by-65-536-characters