Stata read numeric data as string using variable names - csv

I am reading a csv file into Stata using
import delimited "../data_clean/winter20.csv", encoding(UTF-8)
The raw data looks like:
y id1
-.7709586 000000000020
-.4195721 000000003969
-.8932499 300000000021
-1.256116 200000007153
-.7858037 000000000000
The imported data become:
y id1
-.7709586 20
-.4195721 000000003969
-.8932499 300000000021
-1.256116 200000007153
-.7858037 0
However, there are some columns of IDs which are read as numeric. I would like to import them as strings. I want to read the data exactly as how the raw data looks like.
The way I found online is:
import delimited "/Users/tianwang/Dropbox/Construction/data_clean/winter20.csv", encoding(UTF-8) stringcols(74 97 116) clear
However, the raw data may be updated and column numbers may change. The following
import delimited "/Users/tianwang/Dropbox/Construction/data_clean/winter20.csv", encoding(UTF-8) stringcols(id1 id2 id3) clear
gives error id1: invalid numlist in stringcols() option. Is there a way to specify variable names rather than column numbers?
The reason is leading zeros are missing if I read IDs as numeric. Methodtostring does not recover the leading zeros. format id1 %09.0f only works if variables have equal number of digits.

I think this should do it.
import delimited "../data_clean/winter20.csv", stringcols(_all) encoding(UTF-8) clear
PS: Tested in Stata16/Win10

Related

Can't display CSV file in pyspark(ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling)

I'm getting an error while displaying a CSV file through Pyspark. I've attached the PySpark code and CSV file that I used.
from pyspark.sql import *
spark.conf.set("fs.azure.account.key.xxocxxxxxxx","xxxxx")
time_on_site_tablepath= "wasbs://dwpocblob#dwadfpoc.blob.core.windows.net/time_on_site.csv"
time_on_site = spark.read.format("csv").options(header='true', inferSchema='true').load(time_on_site_tablepath)
display(time_on_site.head(50))
The error is shown below
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
CSV file format is attached below
time_on_site:pyspark.sql.dataframe.DataFrame
next_eventdate:timestamp
barcode:integer
eventdate:timestamp
sno:integer
eventaction:string
next_action:string
next_deviceid:integer
next_device:string
type_flag:string
site:string
location:string
flag_perimeter:integer
deviceid:integer
device:string
tran_text:string
flag:integer
timespent_sec:integer
gg:integer
CSV file data is attached below
next_eventdate,barcode,eventdate,sno,eventaction,next_action,next_deviceid,next_device,type_flag,site,location,flag_perimeter,deviceid,device,tran_text,flag,timespent_sec,gg
2018-03-16 05:23:34.000,1998296,2018-03-14 18:50:29.000,1,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,124385,0
2018-03-17 07:22:16.000,1998296,2018-03-16 18:41:09.000,3,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,45667,0
2018-03-19 07:23:55.000,1998296,2018-03-17 18:36:17.000,6,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,1,132458,1
2018-03-21 07:25:04.000,1998296,2018-03-19 18:23:26.000,8,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,133298,0
2018-03-24 07:33:38.000,1998296,2018-03-23 18:39:04.000,10,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,46474,0
What could be done to load the CSV file successfully?
There is no issue in your syntax, it's working fine.
The issue is in your data of CSV file, where the column named as type_flag have only None(null) values, So it doesn't infer it's Datatype.
So, here are two options.
you can display the data without using head(). Like
display(time_on_site)
If you want to use head() then you need to replace the null value, at here I replaced it with the empty string('').
time_on_site = time_on_site.fillna('')
display(time_on_site.head(50))
For some reason, probably a bug, even if you provide a schema on the spark.read.schema(my_schema).csv('path') call
you get the same error on a display(df.head()) call
the display(df) works though, but it gave me a WTF moment.

convert json text entries to a dataframe in r

I have a text file with json like structure that contains values for certain variables as below.
[{"variable1":"111","variable2":"666","variable3":"11","variable4":"aaa","variable5":"0"}]
[{"variable1":"34","variable2":"12","variable3":"78","variable4":"qqq","variable5":"-9"}]
Every line is a new set of values for the same variables 1 through 5. There can be 1000s of lines in a text file but the variables would always remain the same. I want to extract variable 1 through 5 along with their values and convert into a dataframe. Currently I perform these operations in excel using string manipulation and transpose. Here is what it looks like in excel -
How to do this in R? Much appreciated. Thanks.
J
There is a package named jsonlite that you can use.
library("jsonlite")
df<- fromJSON("YourPathToTheFile")
You can find more info here.

xlsread in octave return zero values

I am trying to read a csv file in octave. The file contains a table with both numeric and text data. It also contains information of date and hour. In addition, the first line is in a different format then the rest of the lines since it contains titles.
The csvread can only read numeric data (according to Octave help), so I tried using xlsread as follows:
[NUMARR, TXTARR, RAWARR, LIMITS] = xlsread ('Line.csv')
I get only a matrix of NUMARR with numeric values. However, all other returned variables are empty- their dimension is 0x0.
How do I get all the text and all other information?
TX!
To solve this issue, open your CSV file in Windows notepad and save it as ANSI format instead of UNICODE.

Specify empty values by character string in PROC IMPORT

I'm coming to SAS from R in which this problem is fairly easy to solve.
I'm trying to load a bunch of CanSim CSV files (one example table here) with a %Macro function.
%Macro ReadCSV (infile , outfile );
PROC IMPORT
DATAFILE= &infile.
OUT= &outfile.
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
%Mend ReadCSV;
%ReadCSV("\\DATA\CanSimTables\02820135-eng.csv", work.cs02820135);
%ReadCSV("\\DATA\CanSimTables\02820158-eng.csv", work.cs02820158);
The problem is that the numeric Value column has ".." in all the csv's whenever the value is missing. This is creating an error when IMPORT gets to the rows with this character string.
Is there some way to tell IMPORT that any ".." should be removed or treated as missing values? (I found forums referring to the DSD option, but that doesn't seem to help me here.)
Thanks!
PROC IMPORT can only guess at the structure of your data. For example it might see the .. and assume the column contains a character string instead of a number. It can also make other decisions that can made the generated dataset useless.
You will be better served to write you own data step code to read the file. It is not very difficult to do. For your example linked file all I did was copy and paste the first row of the CSV file and remove the commas, make the names valid variable names and take some guesses as to how long to make the character variables.
data want ;
infile "&path/&fname" dsd truncover firstobs=2 ;
length Ref_Date $7 GEO $100 Geographical_classification $20
CHARACTERISTICS $100 STATISTICS DATATYPE $50 Vector Coordinate $20
Value 8
;
input (Ref_Date -- Value) (??) ;
run;
The ?? modifier will tell SAS not to report any errors when trying the convert the text in the VALUE column into a number. So the .. and other garbage in the file will generate missing values.
Not explicitly relevant for this question, but - if your issue were "N" or "D" or similar that you wanted to become missing, there would be a somewhat easier solution: the missing statement (importantly distinct from the missing option).
missing M;
That tells SAS to see a single character M in the data as a missing value, and read it in accordingly. It would read it in as .M special missing value, which is functionally similar to . regular missing (but not actually equal in an equality statement).

Prevent commas in CSV file from defining a new column when generating CSV

Example $this->generating_csv = "12,000".",";
instead of insert 12,000 in one column is inserting 12 in first and 000 in second column.
how can i solve this (php)
That's the reading program's fault, if you are quoting the fields.
But it looks like you aren't quoting the fields.
RFC4180 allows for a line in a CSV file like:
"A","12,000","C"
to be 3 fields, not 4 fields. Notice that the fields are quoted.
A,12,000,C
is 4 fields, with A for field 1, 12 for field 2, 000 for field 3, and C for field 4. The reading program can't read this any other way... it can't read minds.
But many programs can not deal with a CSV file with commas in fields, even when the fields are correctly quoted so that they may include a comma. Adding to the confusion, many programs can be set up to split on comma rather than read according to RFC4180. Splitting on comma is not sufficient to read the full RFC4180 CSV format.
That being said, there is another problem here: 12,000 is not a valid number in Python, Javascript, R, and many languages, so even if the program is using a good RFC4180 reader (e.g. the csv module in python), changing the 12,000 into a number will still fail.
So possibly you want to remove commas between digits in quoted fields.
In your case, since you are generating the CSV file in PHP, simply
try these two approaches:
Alternative A: Quote the fields
$this->generating_csv = '"12,000"'.','.'"field2"'
This will cause the " to appear in the output
Alternative B: generate the field without the extra comma:
$this->generating_csv = "12000".","."field2"
If you already have several CSV files with commas in quoted fields and the reading program will not read them correctly, you can search/replace using a regexp to clean out the commas. In Python the following will clean up commas in a csv file:
import csv
import sys
import re
infile = open(sys.argv[1],'r')
outfile = open(sys.argv[2],'w')
csvin = csv.reader(infile)
csvout = csv.writer(outfile, quoting=csv.QUOTE_ALL)
for rowin in csvin:
rowout = []
for origfield in rowin:
field = origfield[1:-1] if origfield[0:1]=='"' else origfield
# change 10,000 to 10000
field = re.sub(r'(\d+),(\d+)', r'\1\2', field)
# change Cleveland, OH to Cleveland OH
field = field.replace(r',',r' ')
rowout.append(field)
csvout.writerow(rowout)