How do I extract rows from a PDF file into a csv file?

前端 未结 4 1457
清酒与你
清酒与你 2021-01-06 10:51

I want to get a list of all the colleges in USA from this PDF file and put it into a CSV file. I will then import the CSV file into SQL server (so that I can run queries eas

4条回答
  •  鱼传尺愫
    2021-01-06 11:32

    Another solution without much effort for writing code to get the pdf read: There is a linux tool with a great -layout flag as already mentioned in askubuntu. It's called pdftotext:

    $ pdftotext -layout  
    

    It worked very promising for your provided pdf file. Sure, it's not a complete solution for your problem. But all you have to do then is to clean the text-output. This could be less time-sufficient than other solutions.

    Here is a sample:

    $ head -30 test.txt
                                                                                                              Updated
                                         SEVP Certified Schools                                      September 16, 2015
    SCHOOL NAME                                     CAMPUS NAME                            F M CITY                     ST   CAMPUS ID
    "I Am" School Inc.                              "I Am" School Inc.                     Y N Mount Shasta             CA     41789
    424 Aviation                                    424 Aviation                           N Y Miami                    FL     103705
                                                                ‐ A ‐
    A F International School of Languages Inc.      A F International College              Y   N Los Angeles            CA      9538
    A F International School of Languages Inc.      A F International of Westlake          Y   N Westlake Village       CA     57589
                                                    Village
    A. T. Still University of Health Sciences       Kirksville Coll of Osteopathic         Y   N Kirksville         MO         3606
                                                    Medicine
    Aaron School                                    Aaron School ‐ 30th Street             Y   N   New York             NY    159091
    Aaron School                                    Aaron School                           Y   N   New York             NY    114558
    ABC Beauty Academy, INC.                        ABC Beauty Academy, INC.               N   Y   Flushing             NY    95879
    ABC Beauty Academy, LLC                         ABC Beauty Academy                     N   Y   Garland              TX    50677
    Abcott Institute                                Abcott Institute                       N   Y   Southfield           MI    197890
    Aberdeen Catholic School System                 Roncalli Primary                       Y   N   Aberdeen             SD    180510
    Aberdeen Catholic School System                 Roncalli                               Y   N   Aberdeen             SD    21405
    Aberdeen Catholic School System                 Roncalli Elementary                    Y   N   Aberdeen             SD    180511
    Aberdeen School District 6‐1                    Aberdeen Central High School           Y   N   Aberdeen             SD    36568
    Abiding Savior Lutheran School                  Abiding Savior Lutheran School         Y   N   Lake Forest          CA     9920
    Abilene Christian Schools                       Abilene Christian Schools              Y   N   Abilene              TX     8973
    Abilene Christian University                    Abilene Christian University           Y   N   Abilene              TX     7498
    Abington Friends School                         Abington Friends School                Y   N   Jenkintown           PA    20191
    Above It All, Inc                               Benchmark Flight /Hawaii Flight        N   Y   Kailua‐Kona          HI    24353
                                                    Academy
    Abraham Baldwin Agricultural College            Tifton Campus                          Y   N Tifton             GA         6931
    Abraham Joshua Heschel School                   Abraham Joshua Heschel School          Y   N New York           NY        106824
    
    ABT Jacqueline Kennedy Onassis School           ABT Jacqueline Kennedy Onassis         Y   Y New York               NY     52401
    

    So this turns your problem in transforming that text output to a database readable csv file. Maybe you or another could prefer this way of doing it.

提交回复
热议问题