Manually reading in a specific section of a csv file into SAS - csv

I want to read a csv file into SAS, but I only want to read in part of the file.
For example, I want my first row of data to start at row 18, while I want to read in columns 9, 11, 12, 13, 19, 20, 36. Is there an efficient way of doing this manually in a data step to read in the file portions I want, or is my best bet just to read in the entire file using the import wizard and just keep the columns of desire?

You can change the row you start with the DATAROW option on PROC IMPORT, or FIRSTOBS option on a data step input.
You cannot easily read in only select columns, however. You would have to read in all columns up to the last column you are interested in, and then drop the uninteresting ones. You could read them all in with a $1 character called "blank" or something (even the same name each time), but you do have to ask for them.
The only workaround would be to write a regular expression to read in your data, in which case you could tell it to look for ,.*?,.*?, etc. for each skipped column.

If you can use variable names instead of column numbers, this will work. I'd recommend using variable names instead of numbers anyways, as it adds substantive meaning to your code and might help you catch a problem if the input file columns are ever changed.
PROC IMPORT datafile = "filename.csv"
out = data_read (keep = var1 var2 var3)
dbms = csv
replace;
datarow = 18;
RUN;

Related

Replacing multiple values in CSV

I have a directory full of CSVs. A script I use loads each CSV via a Loop and corrects commonly known errors in several columns prior to being imported into an SQL database. The corrections I want to apply are stored in a JSON file so that a user can freely add/remove any corrections on-the-fly without altering the main script.
My script works fine for 1 value correction, per column, per CSV. However I have noticed that 2 or more columns per CSV now contain additional errors, as well as more than one correction per column is now required.
Here is relevant code:
with open('lookup.json') as f:
translation_table = json.load(f)
for filename in gl.glob("(Compacted)_*.csv"):
df = pd.read_csv(filename, dtype=object)
#... Some other enrichment...
# Extract the file "key" with a regular expression (regex)
filekey = re.match(r"^\(Compacted\)_([A-Z0-9-]+_[0-9A-z]+)_[0-9]{8}_[0-9]{6}.csv$", filename).group(1)
# Use the translation tables to apply any error fixes
if filekey in translation_table["error_lookup"]:
tablename = translation_table["error_lookup"][filekey]
df[tablename[0]] = df[tablename[0]].replace({tablename[1]: tablename[2]})
else:
pass
And here is the lookup.json file:
}
"error_lookup": {
"T7000_08": ["MODCT", "C00", -5555],
"T7000_17": ["MODCT", "C00", -5555],
"T7000_20": ["CLLM5", "--", -5555],
"T700_13": ["CODE", "100T", -5555]
}
For example if a column (in a CSV that includes the key "T7000_20") has a new erroneous value of ";;" in column CLLM5, how can I ensure that values that contain "--" and ";;" are replaced with "-5555"? How do I account for another column in the same CSV too?
Can you change the JSON file? The example below would edit Column A (old1 → new 1 and old2 → new2) and would make similar changes to Column B:
{'error_lookup': {'T7000_20': {'colA': ['old1', 'new1', 'old2', 'new2'],
'colB': ['old3', 'new3', 'old4', 'new4']}}}
The JSON parsing gets more complex, in order to handle current use case and new requirements.

Skip invalid data when importing .tsv using MySQL LOAD DATA IFILE

I am trying to import a bunch of .tsv files into a MySQL database. However, in some of the files, there are errors in some of the rows (these files were generated from another system where data is manually inputted, so these errors are human errors). When I try to use LOAD DATA INFILE to import them, when the command gets to that row of bad data, the command writes NULL values for that field and then proceeds to stop the command, whereas I need it to keep going.
The bad rows look like this:
value1, value 2, value 3
bob, 3, st
john, 4, rd
dianne4ln
jack, 7, cir
I've made sure the line terminators are correct, and use Ignore and Replace parameters to no avail.
Use IGNORE in your query to skip error lines and proceed. See here and here

Python 3: write string list to csv file

I have found several answers (encoding, decoding...) online, but I still don't get what to do.
I have a list called abc.
abc = ['sentence1','-1','sentence2','1','sentence3','0'...]
Now I would like to store this list in a CSV file, the following way:
sentence1, -1
sentence2, 1
sentence3, 0
I know that the format of my abc list probably isn't how it should to achieve this. I guess it should be a list of lists? But the major problem is actually that I have no clue how to write this to a CSV file, using Python 3. The only times it kinda worked, was when every character turned out to be separated by a comma.
Does anybody know how to solve this? Thank you!
You can use zip and join to create a new list and then write to csv :
abc=['sentence1', '-1', 'sentence2', '1', 'sentence3', '0', 'sentence4']
new=[(abc[0],)]+[(''.join(i),) for i in zip(abc[1::2],abc[2::2])]
import csv
with open('test.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(new)
result :
sentence1
-1sentence2
1sentence3
0sentence4
Here is the documentation to work with files, and CSV is basically the same thing as txt, the difference is that you should use commas to separate the columns and new lines to rows.
In your example you could do this (or iterate over a loop):
formated_to_csv = abc[0]+','+abc[1]+','+abc[2]+','+abc[3]...
the value of formated_to_csv would be 'sentence1,-1,sentence2,1,sentence3,0'.. note that this is a single string, so it will generate a single row, and then write the formated_to_csv as text in the csv file :
f.write(formated_to_csv)
To put all sentences on the first column and all the numbers on the second column it would be better to have a list of lists :
abc = [['sentence1','-1'],['sentence2','1'],['sentence3','0']...]
for row in abc:
f.write(row[0]+','+row[1])
The "conversion" to table will be done by excel, calc or whatever program that you use to read spreadsheets.

Replace missing value with cell above in either Perl or MySQL?

I'm importing a csv file of contacts and where one parent has many children it leaves the duplicated values blank. I need to make sure that they are populated when they reach the database however.
Is there a way that I can implement the following when I'm importing a .csv file into Perl and then exporting into MySQL?
if (value is null)
value = value above.
Thanks!
Why don't you place the individual values you read from the CSV file into an array (e.g. #FIELD_DATA). Then when you encounter an empty field while iterating over a row (e.g. for column 4) you can write
unless (length($CSV_FIELD[4])) {
$CSV_FIELD[4] = $FIELD_DATA[4]
}
Not with an import statement afaik. You could, however, make use of triggers (http://dev.mysql.com/doc/refman/5.0/en/triggers.html). Keep in mind though, that this will seriously impact the performance of the import statement.
Also: if they are duplicate values you should have a critical look at your database model or your setup overall.

Howto process multivariate time series given as multiline, multirow *.csv files with Apache Pig?

I need to process multivariate time series given as multiline, multirow *.csv files with Apache Pig. I am trying to use a custom UDF (EvalFunc) to solve my problem. However, all Loaders I tried (except org.apache.pig.impl.io.ReadToEndLoader which I do not get to work) to load data in my csv-files and pass it to the UDF return one line of the file as one record. What I need is, however one column (or the content of the complete file) to be able to process a complete time series. Processing one value is obviously useless because I need longer sequences of values...
The data in the csv-files looks like this (30 columns, 1st is a datetime, all others are double values, here 3 sample lines):
17.06.2013 00:00:00;427;-13.793273;2.885583;-0.074701;209.790688;233.118828;1.411723;329.099170;331.554919;0.077026;0.485670;0.691253;2.847106;297.912382;50.000000;0.000000;0.012599;1.161726;0.023110;0.952259;0.024673;2.304819;0.027350;0.671688;0.025068;0.091313;0.026113;0.271128;0.032320;0
17.06.2013 00:00:01;430;-13.879651;3.137179;-0.067678;209.796500;233.141233;1.411920;329.176863;330.910693;0.071084;0.365037;0.564816;2.837506;293.418550;50.000000;0.000000;0.014108;1.159334;0.020250;0.954318;0.022934;2.294808;0.028274;0.668540;0.020850;0.093157;0.027120;0.265855;0.033370;0
17.06.2013 00:00:02;451;-15.080651;3.397742;-0.078467;209.781511;233.117081;1.410744;328.868437;330.494671;0.076037;0.358719;0.544694;2.841955;288.345883;50.000000;0.000000;0.017203;1.158976;0.022345;0.959076;0.018688;2.298611;0.027253;0.665095;0.025332;0.099996;0.023892;0.271983;0.024882;0
Has anyone an idea how I could process this as 29 time series?
Thanks in advance!
What do you want to achieve?
If you want to read all rows in all files as a single record, this can work:
a = LOAD '...' USING PigStorage(';') as <schema> ;
b = GROUP a ALL;
b will contain all the rows in a bag.
If you want to read each CSV file as a single record, this can work:
a = LOAD '...' USING PigStorage(';','tagsource') as <schema> ;
b = GROUP a BY $0; --$0 is the filename
b will contain all the rows per file in a bag.