Open Excel, Parse Data?

南笙酒味 提交于 2019-12-02 13:59:18
jacoblambert

I cobbled together a prototype based on Jon Skeet's recommendation to use Interop, and an answer by juliusz on How do I import from Excel to a DataSet using Microsoft.Office.Interop.Excel?

However, there are two things I really don't like about this: First, it seems I have to type in all the column names. Second, it loads the entire spreadsheet into memory rather than reading as a stream (takes a minute even for relatively small spreadsheets.

private static void Main(string[] args)
{
    var dataT = Import(@"C:\Users\jlambert\Desktop\dSmall_encrypted.xlsx");

    var data = dataT.AsEnumerable();
...
}
public static System.Data.DataTable Import(String path)
{
    var app = new Application();
    Workbook workBook = app.Workbooks.Open(path, 0, true, 5, "", "", true, XlPlatform.xlWindows, "\t", false, false, 0, true, 1, 0);

    Worksheet workSheet = (Worksheet)workBook.ActiveSheet;

    int index = 0;
    object rowIndex = 2;

    System.Data.DataTable dt = new System.Data.DataTable();
    dt.Columns.Add("Facility_code");                    
    dt.Columns.Add("MRN");
    dt.Columns.Add("first_name");
    dt.Columns.Add("middle_name");
    dt.Columns.Add("last_name");
    dt.Columns.Add("address_line_1");
    dt.Columns.Add("address_line_2");
    dt.Columns.Add("city");
    dt.Columns.Add("state");
    dt.Columns.Add("zip");
    dt.Columns.Add("date_of_birth");
    dt.Columns.Add("gender");
    dt.Columns.Add("ssn");
    dt.Columns.Add("home_phone");
    dt.Columns.Add("work_phone");
    dt.Columns.Add("cell_phone");
    dt.Columns.Add("PCP");
    dt.Columns.Add("Practice Location");

    DataRow row;

    while (((Range)workSheet.Cells[rowIndex, 1]).Value2 != null)
    {
        rowIndex = 2 + index;
        row = dt.NewRow();
        row[0] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 1]).Value2);
        row[1] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 2]).Value2);
        row[2] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 3]).Value2);
        row[3] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 4]).Value2);
        row[4] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 5]).Value2);
        row[5] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 6]).Value2);
        row[6] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 7]).Value2);
        row[7] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 8]).Value2);
        row[8] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 9]).Value2);
        row[9] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 10]).Value2);
        row[10] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 11]).Value2);
        row[11] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 12]).Value2);
        row[12] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 13]).Value2);
        row[13] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 14]).Value2);
        row[14] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 15]).Value2);
        row[15] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 16]).Value2);
        row[16] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 17]).Value2);
        row[17] = Convert.ToString(((Range)workSheet.Cells[rowIndex, 18]).Value2);
        index++;
        dt.Rows.Add(row);
    }
    app.Workbooks.Close();
    return dt;
}

This is lightening fast; I can't tell I'm not doing text manipulations with ReadAllLines and Regex. (Removed the text manipulation details).

How I got it to work is explained in the comments to the original question.

Not sure why it's so much faster than Interop.Excel. Inefficient coding? More efficient API?

Any insight is appreciated!

        var path = string.Format(@"C:\Users\jlambert\Desktop\encryptedSSNs.xlsx");
        var connStr = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + path + ";Extended Properties=Excel 12.0;";


        var adapter = new OleDbDataAdapter("SELECT * FROM [sheetName$]", connStr);
        var ds = new DataSet();

        adapter.Fill(ds, "anyNameHere");

        var data = ds.Tables["anyNameHere"].AsEnumerable();

        var query = data.Where(x => x.Field<string>("MRN") != string.Empty).Select(x =>
            new 
            {
                mrn = x.Field<string>("MRN"),
                ssn = x.Field<string>("ssn"),
            });

        foreach (var q in query)
        {
            Console.WriteLine(q);    
        }
        Console.ReadLine();
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!