SAS Infile statement errors when trying to read in CSV file - csv

My data looks like the following (this is a really small subset). This is a CSV file that's actually column seperated, so it can be read quite easily in Excel.
ParentFlag CurrentBalanceF ValuationAmountO ValuationDateO ValuationDateC
PARENT 85481.49 145000 13/02/2004 30/04/2009
I'm trying to use the following code to import my data.
filename indata '&location.\AE&CO - inputs - 2014 09 30 Sept.csv';
Data treasury;
Infile indata firstobs=2 dlm=" "
/* delimiter=','*/
;
/* Length ext_acno $14.;*/
/* informat original_val_dt date9. current_val_dt date9. ;*/
input pcd_acno $ 1 ext_acno $ 2 loan_acno $ 3 acno $ 4 account_bal $ 5 trust_id $ 6 parentflag $ 7 account_bal_f 8
original_val_amt 9 original_val_dt 10 current_val_dt 11 original_val_type 12
current_val_type 13 indexed_ltv 14 original_ltv_wo_fees 15 latest_ltv 16 account_status_rbs $ 17 ;
;
run;
However, the log gives me errors and the data doesn't import properly. My data set has fields that only have one character visible (for example, the parentflag field above only has a 0).
I tried doing this using the import wizard, and it worked to a certain extent, but the log comes up with an "import unsuccessful" message, despite my table populating correctly...
Ideally I'd like to get the infile statement working because it feels like a sturdier substitute. For now, it's just not behaving and I've no idea why! Could someone help?

You need to remove those numbers after the dollar signs.
input pcd_acno $ ext_acno $ loan_acno $ ... ;
The numbers are just confusing SAS, they don't serve any purpose. They make SAS think that you're trying to do some sort of column input, but you want list input here. Column means the one character long column, not what you're using it to mean; "CSV" means "Comma separated values", not "column".
IE:
ABCDE FGHIJ KLMNO
Column 5 in the above is "E", and column 9 is "H".

Related

Spark - Strange characters when reading CSV file

I hope someone could help me please. My problem is the following:
To read a CSV file in Spark I'm using the code
val df=spark.read.option("header","true").option("inferSchema","true").csv("/home/user/Documents/filename.csv")
assuming that my file is called filename.csv and the path is /home/user/Documents/
To show the first 10 results I use
df.show(10)
but instead I get the following result which contains the character � and not showing the 10 results as desired
scala> df.show(10)
+--------+---------+---------+-----------------+
| c1| c2| c3| c4|
+--------+---------+---------+-----------------+
|��1.0|5450|3007|20160101|
+--------+---------+---------+-----------------+
The CSV file looks something like this
c1 c2 c3 c4
1 5450 3007 20160101
2 2156 1414 20160107
1 78229 3656 20160309
1 34963 4484 20160104
1 7897 3350 20160105
11 13247 3242 20160303
2 4957 3350 20160124
1 73083 4211 20160207
The file that I'm trying to read is big. When I try smaller files I don't get the strange character and I can see the first 10 results without problem.
Any help is appreciated.
Sometimes it is not the problem caused by settings of Spark. Try to re-save(save as) your CSV file as "CSV UTF-8 (comma delimited)", then rerun your code, the strange characters will gone. I had similar problem when read some CSV file containing German words, then I did above, it is all good.

How to import csv data where some observations are on two rows

I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many) of the observations appear on two lines in the CSV file. Most of the entries occur on only one line. The troublesome observations that take up 2 lines still follow the same pattern as far as being delimited by commas. But in the Stata dataset, the observation shows up on two rows, both rows containing only part of the total data.
I used import delimited to import the data. Is there anything that can be done at the data import stage of the process in Stata? I would prefer to not have to deal with this in the original CSV file if possible.
***Update
Here is an example of what the csv file looks like:
var1,var2,var3,var4,var5
text 1, text 2,text 3 ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14,text15
text16,text17,text18,text19,text20
Notice that there is no comma at the end of the line. Also notice that the problem is with the observation that begins with text 11.
This is basically how it shows up in Stata:
var1 var2 var3 var4 var5
1 text 1 text 2 text 3 text 4 text 5
2 text 6 text 7 text 8 text9 text10
3 text 11 text 1
4 2 text 13 text14 text15
5 text16 text17 text18 text19 text20
That sometimes the number is right next to text isn't a mistake - it is just to illustrate that the data is more complex than is shown here.
Of course, this is how I need the data:
var1 var2 var3 var4 var5
1 text 1 text 2 text 3 text 4 text 5
2 text 6 text 7 text 8 text9 text10
3 text 11 text 12 text 13 text14 text15
4 text16 text17 text18 text19 text20
A convoluted way is (comments inline):
clear
set more off
*----- example data -----
// change delimiter, if necessary
insheet using "~/Desktop/stata_tests/test.csv", names delim(;)
list
*----- what you want -----
// compute number of commas
gen numcom = length(var1var2var3var4var5) ///
- length(subinstr(var1var2var3var4var5, ",", "", .))
// save all data
tempfile orig
save "`orig'"
// keep observations that are fine
drop if numcom != 4
// save fine data
tempfile origfine
save "`origfine'"
*-----
// load all data
use "`orig'", clear
// keep offending observations
drop if numcom == 4
// for the -reshape-
gen i = int((_n-1)/2) +1
bysort i : gen j = _n
// check that pairs add up to 4 commas
by i : egen check = total(numcom)
assert check == 4
// no longer necessary
drop numcom check
// reshape wide
reshape wide var1var2var3var4var5, i(i) j(j)
// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5
// append new observations with original good ones
append using "`origfine'"
// split
split var1var2var3var4var5, parse(,) gen(var)
// we're "done"
drop var1var2var3var4var5 numcom
list
But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.
Note: the file test.csv looks like
var1,var2,var3,var4,var5
text 1, text 2,text 3 ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14,text15
text16,text17,text18,text19,text20
Note 2: I'm using insheet because I don't have Stata 13 at the moment. import delimited is the way to go if available.
Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.
I would try the following strategy.
Import as a single string variable.
Count commas on each line and combine following lines if lines are incomplete.
Delete redundant material.
The comma count will be
length(variable) - length(subinstr(variable, ",", "", .))
If the observations in question are quoted in the CSV file, then you can use the bindquote(strict) option.
A bit of speculation without seeing the exact data: following Roberto Ferrer's comment, you might find the Stata command filefilter useful in cleaning the csv file before importing. You can substitute new and old string patterns, using basic characters as well as more complex \n and \r terms.
I can't offer any code at the moment, but I suggest you take a good look at help import. The infile and infix commands state:
An observation can be on more than one line.
(I don't know if this means that all observations should be on several lines, or if it can handle cases where only some observations are on more than one line.)
Check also the manuals if the examples and notes in the help files turn out to be insufficient.

Reading text/number mixed CSV files as tables in Octave

is there an easy way in octave to load data from a csv in a data structure similar to dataframes in R? I tries csvread dlmread but octave keeps reading test a imaginary numbers, plus I'd like to have column's headers as references. I saw that there are some examples online which see way too twisted, how is it possible that there is not a function or something similar to dataframe of R? I say a package called dataframe but I can't seem to figure out how it works. Any tip or suggestion?
csvread('x') %returns 1 column imaginary numbers
dlmread('x') %returns N columns imaginary numbers
Any working alternative?
Why are you unable to make the dataframe package work? You need to be more specific. Here's a simple example:
$ cat cars.csv
Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar
$ octave
octave-cli-3.8.2:1> pkg load dataframe
octave-cli-3.8.2:2> cars = dataframe ("cars.csv")
cars = dataframe with 2 rows and 3 columns
Src: cars.csv
_1 Year Make Model
Nr double char char
1 1997 Ford E350
2 2000 Mercury Cougar

splitting CSV file by columns

I have a really huge CSV files. There are about 1700 columns and 40000 rows like below:
x,y,z,x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1700 more)...,x1700
0,0,0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1700 more)...,a1700
1,1,1,b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1700 more)...,b1700
// (about 40000 more rows below)
I need to split this CSV file into multiple files which contain a less number of columns like:
# file1.csv
x,y,z
0,0,0
1,1,1
... (about 40000 more rows below)
# file2.csv
x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1000 more)...,x1000
a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1000 more)...,a1000
b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1000 more)...,b1700
// (about 40000 more rows below)
#file3.csv
x1001,x1002,x1003,x1004,x1005,...(about 700 more)...,x1700
a1001,a1002,a1003,a1004,a1005,...(about 700 more)...,a1700
b1001,b1002,b1003,b1004,b1005,...(about 700 more)...,b1700
// (about 40000 more rows below)
Is there any program or library doing this?
I've googled for it , but programs that I found only split a file by rows not by columns.
Or which language could I use to do this efficiently?
I can use R, shell script, Python, C/C++, Java
A one-line solution for your example data and desired output:
cut -d, -f -3 huge.csv > file1.csv
cut -d, -f 4-1004 huge.csv > file2.csv
cut -d, -f 1005- huge.csv > file3.csv
The cut program is available on most POSIX platforms and is part of GNU Core Utilities. There is also a Windows version.
update in python, since the OP asked for a program in an acceptable language:
# python 3 (or python 2, if you must)
import csv
import fileinput
output_specifications = ( # csv file name, selector function
('file1.csv', slice(3)),
('file2.csv', slice(3, 1003)),
('file3.csv', slice(1003, 1703)),
)
output_row_writers = [
(
csv.writer(open(file_name, 'wb'), quoting=csv.QUOTE_MINIMAL).writerow,
selector,
) for file_name, selector in output_specifications
]
reader = csv.reader(fileinput.input())
for row in reader:
for row_writer, selector in output_row_writers:
row_writer(row[selector])
This works with the sample data given and can be called with the input.csv as an argument or by piping from stdin.
Use a small python script like:
fin = 'file_in.csv'
fout1 = 'file_out1.csv'
fout1_fd = open(fout1,'w')
...
lines = []
with open(fin) as fin_fd:
lines = fin_fd.read().split('\n')
for l in lines:
l_arr = l.split(',')
fout1_fd.write(','.join(l_arr[0:3]))
fout1_fd.write('\n')
...
...
fout1_fd.close()
...
You can open the file in Microsoft Excel, delete the extra columns, save as csv for file #1. Repeat the same procedure for the other 2 tables.
I usually use open office ( or microsof excel in case you are using windows) to do that without writing any program and change the file and save it. Following are two useful links showing how to do that.
https://superuser.com/questions/407082/easiest-way-to-open-csv-with-commas-in-excel
http://office.microsoft.com/en-us/excel-help/import-or-export-text-txt-or-csv-files-HP010099725.aspx

How to merge two ipython notebooks correctly without getting json error?

I have tried:
cat file1.ipynb file2.ipynb > filecomplete.ipynb
since the notebooks are simply json files, but this gives me the error
Unreadable Notebook: Notebook does not appear to be JSON: '{\n "metadata": {'
I think these must be valid json files because file1 and file2 each load individually into nbviewer, and so I am not entirely sure what I am doing wrong.
This Python script concatenates all the notebooks named with a given prefix and present at the first level of a given folder. The resulting notebook is saved in the same folder under the name "compil_" + prefix + ".ipynb".
import json
import os
folder = "slides"
prefix = "quiz"
paths = [os.path.join(folder, name) for name in os.listdir(folder) if name.startswith(prefix) and name.endswith(".ipynb")]
result = json.loads(open(paths.pop(0), "r").read())
for path in paths:
result["worksheets"][0]["cells"].extend(json.loads(open(path, "r").read())["worksheets"][0]["cells"])
open(os.path.join(folder, "compil_%s.ipynb" % prefix), "w").write(json.dumps(result, indent = 1))
Warning: the metadata are those of the first notebook, and the cells those of the first worksheet only (which seems to contain all the cells, in my notebook at least).
Concatenating 2 object with some properties does not always yield object with the same properties. Here is a sequence of increasing number : 4 8 15 16 23 42, here is another one 1 2 3 4 5 6 7. The concatenation of the two is not strictly increasing :4 8 15 16 23 42 1 2 3 4 5 6 7. Same goes for Json.
You need to load json file using json lib and do the merge you want to do yourself. I suppose you "just" want to concatenate the cells, but maybe you want to concatenate worksheet; maybe you want to merge metadata.