Load *.htm file saved as .xls (starting from row number 5) Using Power Query

别等时光非礼了梦想. 提交于 2019-12-06 18:17:25

For whatever reason, the HTML in the sample file has unmatched tags that the XML parser doesn't like. You can get at the data though with some work if you load it as text and remove or fix any parts that the parser has trouble with.

Consider this M code:

let
    Source = Table.FromColumns({Lines.FromBinary(File.Contents("C:\Users\aolson\Downloads\example-html.xls\example-html.xls"))}),
    #"Kept Range of Rows" = Table.Range(Source,60,22),
    Column1 = Text.Combine(#"Kept Range of Rows"[Column1]),
    #"Parsed XML" = Xml.Tables(Column1),
    Table = #"Parsed XML"{0}[Table],
    #"Expanded td" = Table.ExpandTableColumn(Table, "td", {"i", "b", "span", "Element:Text"}, {"td.i", "td.b", "td.span", "td.Element:Text"}),
    #"Expanded td.span" = Table.ExpandTableColumn(#"Expanded td", "td.span", {"Element:Text", "Attribute:style"}, {"td.span.Element:Text", "td.span.Attribute:style"})
in
    #"Expanded td.span"

The steps here are roughly:

  1. Load the file as text
  2. Select just the <tbody> section.
  3. Concatenate those rows into a single text value.
  4. Parse that text as XML.
  5. Expand any tables that are found.

When I initially did this I noticed the <b> tag wasn't closed so I added a </b> in my source file.

The results are a bit ugly, but I suspect if your actual data files don't include much formatting or inconsistent table structure, then you can get something along these lines working passably well, especially if you only have a single column to deal with.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!