Best way to parse an HTML table into a CSV

问题

I've got to grab some product data off an existing website to put into a database. The data is all in HTML table format, the model numbers are unique, but each product can have any number of different attributes (so the tables I need to parse all have different columns and headings).

<table>
<tr>
<td>Model No.</td>
<td>Weight</td>
<td>Colour</td>
<td>Etc..</td>
</tr>
<tr>
<td>8572</td>
<td>12 Kg</td>
<td>Red</td>
<td>Blah..</td>
</tr>
<tr>
<td>7463</td>
<td>7 Kg</td>
<td>Blue</td>
<td>Blah..</td>
</tr>
<tr>
<td>8332</td>
<td>42 Kg</td>
<td>Yellow</td>
<td>Blah..</td>
</tr>
</table>

This is the CSV output format I'm looking for:

Model-No,Attribute-Name,Attribute-Value
8572,"Weight","12 Kg"
8572,"Colour","Red"
8572,"Etc","Blah.."
7463,"Weight","7 Kg"
7463,"Colour","Blue"
7463,"Etc","Blah.."
8332,"Weight","42 Kg"
8332,"Colour","Yellow"
8332,"Etc","Blah.."

As the tables all appear to be valid xhtml I'll probably load each one into an XmlDocument, but does anyone have any suggestions for a better way of accomplishing this? Thanks.

回答1:

I can think of 3 ways to do this:

HTML Agility pack: load the HTML and loop through the elements and write your CSV. Some examples here.
Use Regex to parse the table.
if your HTML is XHTML (valid XML) you can write a XSLT template to create the CSV authomatically. This is the neatest but not the easiest one.

回答2:

You can always go with Linq to XML assuming you are at least in NET 3.5 environment.

回答3:

HtmlAgilityPack is amazing for scraping data off html web pages, use that to scrap the tables into some sort of intermediate object, then you can form a csv file from this object.

回答4:

In addition to HtmlAgilityPack, Khaled Nassar mentioned. You can do it via jQuery apply .each('tr') and assign 1st, 2nd and 3rd child to product object which you can send via service or handler which will create cvs from it.

回答5:

There is a very easy way (albeit an inelegent one) to accomplish this. If its just a one off, just open the htm/html file with the table in it with excel and then save the sheet as a .csv file (if there is any data outside the table in the file it can easily be removed from excel).

If you will be repeating this task you can use Microsoft.Office.Interop.Excel namespace in C# (or VB .net) to automate it in a few lines like so:

using Microsoft.Office.Interop.Excel;

...

Application app = new Application();
app.ScreenUpdating = false;
app.DisplayAlerts = false;
app.AskToUpdateLinks = false;
app.Visible = false;

Workbook workbook = app.Workbooks.Open(fileName + ".html", false, false,
               Type.Missing, Type.Missing, Type.Missing, Type.Missing,
               Type.Missing, Type.Missing,
               Type.Missing, Type.Missing, Type.Missing, Type.Missing,
               Type.Missing, Type.Missing);


workbook.SaveAs(fileName + ".csv", Microsoft.Office.Interop.Excel.XlFileFormat.xlCSV);

workbook.Close(false, Type.Missing, Type.Missing);
workbook = null;
app.Quit();
app = null;

...

And for this case it should be easy to parse out the non-table in the html file using Regex on the table tags if necessary. In Visual Studio 2005 and up you just right click on references for your project and you should find Microsoft.Office.Interop.Excel under the .Net tab.

来源：https://stackoverflow.com/questions/6356424/best-way-to-parse-an-html-table-into-a-csv

标签

html-parsing