How to create a truncated permanent database from a larger file in SAS [duplicate]

问题

I'm trying to read a comma delimited .txt file (called 'file.txt' in the code below) into SAS in order to create a permanent database that includes only some of the variables and observations.

Here's a snippet of the .txt file for reference:

SUMLEV,REGION,DIVISION,STATE,NAME,POPESTIMATE2013,POPEST18PLUS2013,PCNT_POPEST18PLUS
10,0,0,0,United States,316128839,242542967,76.7
40,3,6,1,Alabama,4833722,3722241,77
40,4,9,2,Alaska,735132,547000,74.4
40,4,8,4,Arizona,6626624,5009810,75.6
40,3,7,5,Arkansas,2959373,2249507,76

My (abbreviated) code is as follows:

options nocenter nodate ls=72 ps=58;
filename foldr1 'C:\Users\redacted\Desktop\file.txt';
libname foldr2 'C:\Users\redacted\Desktop\Data';
libname foldr3 'C:\Users\redacted\Desktop\Formats';
options fmtsearch=(FMTfoldr.bf_fmts);

proc format library=foldr3.bf_fmts;
[redacted]
run;

data foldr2.file;
infile foldr1 DLM=',' firstobs=2 obs=52;
input STATE $ NAME $ REGION $ POPESTIMATE2013;
PERCENT=POPESTIMATE2013/316128839;
format REGION $regfmt.;
run;

proc print data=foldr2.file;
sum POPESTIMATE2013 PERCENT;
title 'Title';
run;

In my INPUT statement, I list the variables that I want to include in my new truncated database (STATE, NAME, REGION, etc.).

When I print my truncated database, I notice that all of my INPUT variables do not correspond to the same variables in the original file. Instead my variables print out like this:

STATE (1st var listed in INPUT) printed as SUMLEV (1st var listed in .txt file)
NAME (2nd var listed in INPUT) printed as REGION (2nd var listed in .txt file)
REGION (3rd " " " ") printed as DIVISION (3rd " " " ")
POPESTIMATE2013 (4th " " " ") printed as STATE (4th " " " ")

It seems that SAS is matching my INPUT variables based on order, not on name. So, because I list STATE first in my INPUT statement, SAS prints out the first variable of the original .txt file (i.e., the SUMLEV variable).

Any idea what's wrong with my code? Thanks for your help!

回答1:

Your current code is reading in the first 4 values from each line of the CSV file and assigning them to columns with the names you have listed.

The input statement lists all the columns you want to read in (and where to read them from), it does not search for named columns within the input file.

The code below should produce the output you want. The keep statement lists the columns that you want in the output.

data foldr2.file;
    infile foldr1 dlm = "," firstobs = 2 obs = 52;
    /* Prevent truncating the name variable */
    informat NAME $20.;
    /* Name each of the columns */
    input SUMLEV REGION DIVISION STATE NAME $ POPESTIMATE2013 POPEST18PLUS2013 PCNT_POPEST18PLUS;
    /* Keep only the columns you want */
    keep STATE NAME REGION POPESTIMATE2013 PERCENT;
    PERCENT = POPESTIMATE2013/316128839;
    format REGION $regfmt.;
run;

For a slightly more involved solution see Joe's excellent answer here. Applying this approach to your data will require setting the lengths of your columns in advance and converting character values to numeric.

data foldr2.file;
    infile foldr1 dlm = "," firstobs = 2 obs = 52;
    length STATE 8. NAME $13. REGION 8. POPESTIMATE2013 8.;
    input @;
    STATE = input(scan(_INFILE_, 4, ','), best.);
    NAME = scan(_INFILE_, 5, ',');
    REGION = input(scan(_INFILE_, 2, ','), best.);
    POPESTIMATE2013 = input(scan(_INFILE_, 6, ','), best.);
    PERCENT = POPESTIMATE2013/316128839;
    format REGION $regfmt.;
run;

If you are looking to become more familiar with SAS it would be worth your while to take a look at the SAS documentation for reading files.

回答2:

Your current data step is telling SAS what to name the first four variables in the txt file. To do what you want, you need to list all of the variables in the txt file in your "input" statement. Then, in your data statement, use the keep= option to select the variables you want to be included in the output dataset.

data foldr2.file (keep=STATE NAME REGION POPESTIMATE2013 PERCENT);
  infile foldr1 DLM=',' firstobs=2 obs=52;
  input
    SUMLEV
    REGION $
    DIVISION
    STATE $
    NAME $
    POPESTIMATE2013
    POPEST18PLUS2013
    PCNT_POPEST18PLUS;
  PERCENT=POPESTIMATE2013/316128839;
  format REGION $regfmt.;
run;

来源：https://stackoverflow.com/questions/28184526/how-to-create-a-truncated-permanent-database-from-a-larger-file-in-sas

标签

database

sas