When I import TLS201_APPLN.csv form PATSTAT database into SAS 9.4 (Unicode support), a lot of similar codes showed like below. What should I do to fix it?
NOTE: Invalid data for appln_nr_original in line 5286 53-65.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+---
5286 6697,AT,2007000486,W ,2007-10-17,2007,WO2007AT00486,AT2007/000486,PI,0,Y,N,N,2006-12-22,
89 2006,1110640,2008-07-03,2008,6698,0,38109624,4532,10,2,2,1 146
appln_id=6697 appln_auth=AT appln_nr=2007000486 appln_kind=W appln_filing_date=2007-10-17
appln_filing_year=2007 appln_nr_epodoc=WO2007AT00486 appln_nr_original=. ipr_type=PI
internat_appln_id=0 int_phase=Y reg_phase=N nat_phase=N earliest_filing_date=2006-12-22
earliest_filing_year=2006 earliest_filing_id=1110640 earliest_publn_date=2008-07-03
earliest_publn_year=2008 earliest_pat_publn_id=6698 granted=0 docdb_family_id=38109624
inpadoc_family_id=4532 docdb_family_size=10 nb_citing_docdb_fam=2 nb_applicants=2 nb_inventors=1
_ERROR_=1 _N_=5285
Thanks in advance.
Fix the import to correctly ready in apln_nr_original.
Note the bolded sections below.
So I counted out the number of variables and I think it's the 8th variable, which looks like it should be, AT2007/000486 from the record. However, SAS has it as ., which means it thinks it should be a numeric, when it's actually a character variable. So you need to modify your code to account for that. I'd suggest how to do that but you didn't include any code.
NOTE: Invalid data for appln_nr_original in line 5286 53-65.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+---
5286 6697,AT,2007000486,W ,2007-10-17,2007,WO2007AT00486,AT2007/000486,PI,0,Y,N,N,2006-12-22,
89 2006,1110640,2008-07-03,2008,6698,0,38109624,4532,10,2,2,1 146
appln_id=6697
appln_auth=AT
appln_nr=2007000486
appln_kind=W
appln_filing_date=2007-10-17
appln_filing_year=2007
appln_nr_epodoc=WO2007AT00486
appln_nr_original=.
ipr_type=PI
Related
I am trying to upload a .csv file into Workbench using the Table Data Import Wizard.
I receive the following error whenever attempting to load it:
Unhandled exception: 'ascii' codec can't decode byte 0xc3 in position 1253: ordinal not in range(128)
I have tried previous solutions that suggested I encode the .csv file as a MS-DOS csv and as a UTF-8 csv. Neither have worked for me.
Attempting to change the data in the file would not be feasible since its made up of thousands of cells, so it would quite impractical. Is there anything that can be done to resolve this?
What was after the C3? What should have been there?
C3, when interpreted as "latin1" is à -- an unlikely character.
More likely is a 2-byte UTF-8 code that starts with C3. This includes the accented letters of Western European languages. Example é, hex C3A9.
You tried "UTF-8 csv" -- Please provide the specifics of how you tried it. What settings in the Wizard, etc.
Probably you should state that the data is "UTF-8" or utf8mb4, depending on whether you are referring to outside or inside MySQL.
Meanwhile, if you are loading the data into an existing "table", let's see SHOW CREATE TABLE. It should probably not say "ascii" anywhere; instead, it should probably say "utf8mb4".
I'm getting an error while displaying a CSV file through Pyspark. I've attached the PySpark code and CSV file that I used.
from pyspark.sql import *
spark.conf.set("fs.azure.account.key.xxocxxxxxxx","xxxxx")
time_on_site_tablepath= "wasbs://dwpocblob#dwadfpoc.blob.core.windows.net/time_on_site.csv"
time_on_site = spark.read.format("csv").options(header='true', inferSchema='true').load(time_on_site_tablepath)
display(time_on_site.head(50))
The error is shown below
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
CSV file format is attached below
time_on_site:pyspark.sql.dataframe.DataFrame
next_eventdate:timestamp
barcode:integer
eventdate:timestamp
sno:integer
eventaction:string
next_action:string
next_deviceid:integer
next_device:string
type_flag:string
site:string
location:string
flag_perimeter:integer
deviceid:integer
device:string
tran_text:string
flag:integer
timespent_sec:integer
gg:integer
CSV file data is attached below
next_eventdate,barcode,eventdate,sno,eventaction,next_action,next_deviceid,next_device,type_flag,site,location,flag_perimeter,deviceid,device,tran_text,flag,timespent_sec,gg
2018-03-16 05:23:34.000,1998296,2018-03-14 18:50:29.000,1,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,124385,0
2018-03-17 07:22:16.000,1998296,2018-03-16 18:41:09.000,3,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,45667,0
2018-03-19 07:23:55.000,1998296,2018-03-17 18:36:17.000,6,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,1,132458,1
2018-03-21 07:25:04.000,1998296,2018-03-19 18:23:26.000,8,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,133298,0
2018-03-24 07:33:38.000,1998296,2018-03-23 18:39:04.000,10,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,46474,0
What could be done to load the CSV file successfully?
There is no issue in your syntax, it's working fine.
The issue is in your data of CSV file, where the column named as type_flag have only None(null) values, So it doesn't infer it's Datatype.
So, here are two options.
you can display the data without using head(). Like
display(time_on_site)
If you want to use head() then you need to replace the null value, at here I replaced it with the empty string('').
time_on_site = time_on_site.fillna('')
display(time_on_site.head(50))
For some reason, probably a bug, even if you provide a schema on the spark.read.schema(my_schema).csv('path') call
you get the same error on a display(df.head()) call
the display(df) works though, but it gave me a WTF moment.
my data set contains 1300000 observations with 56 columns. it is a .csv file and i'm trying to import it by using proc import. after importing i find that only 44 out of 56 columns are imported.
i tried increasing the guessing rows but it is not helping.
P.S: i'm using sas 9.3
If (and only in that case as far as I am aware) you specify the file to load in a filename statement, you have to set the option lrecl to a value that is large enough.
If you don't, the default is only 256. Ergo, if your csv has lines longer than 256, he will not read the full line.
See this link for more information (just search for lrecl): https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000308090.htm
If you have SAS Enterprise Guide (I think it's now included with all desktop licenses) try out the import wizard. It's excellent. And it will generate code you can reuse with a little editing.
It will take a while to run because it will read your entire file before writing the import logic.
I am writing a data clean up script (MS Smart Quotes, etc.) that will operate on mySQL tables encoded in Latin1. While scanning the data I noticed a ton of 0D 0A where the line breaks are.
Since I am cleaning the data, should I also address all of the 0D, too, by removing them? Is there ever a good reason to keep 0D (carriage return) anymore?
Thanks!
0D0A (\r\n), and 0A (\n) are line terminators; \r\n is mostly used in OS Windows, \n in unix systems.
Is there ever a good reason to keep 0D anymore?
I think you should answer this question yourself.
You could remove '\r' from the data, but make sure that the programs that will use this data understand that '\n' means the end of line very well. In most cases it is taken into account, but check just in case.
The CR/LF combination is a Windows thing. *NIX operating systems just use LF. So based on the application that uses your data, you'll need to make the decision on whether you want/need to filter out CR's. See the Wikipedia entry on newline for more info.
Python's readline() returns a line followed with a \O12. \O means Octal. 12 is octal for decimal 10. You can see on the ASCII table that Dec 10 is NL or LF. Newline or line feed.
Standard for end-of-line in a unix text or script file.
http://www.asciitable.com/
So be aware that the len() will include the NL unless you try to read past the EOF the len() will never be zero.
Therefore if you INSERT any line of text obtained by the Python readline() into a mysql table it will include the NL character by default, at the end.
Is there a way to convert a dta file to a csv?
I do not have a version of Stata installed on my computer, so I cannot do something like:
File --> "Save as csv"
The frankly-incredible data-analysis library for Python called Pandas has a function to read Stata files.
After installing Pandas you can just do:
>>> import pandas as pd
>>> data = pd.io.stata.read_stata('my_stata_file.dta')
>>> data.to_csv('my_stata_file.csv')
Amazing!
You could try doing it through R:
For Stata <= 15 you can use the haven package to read the dataset and then you simply write it to external CSV file:
library(haven)
yourData = read_dta("path/to/file")
write.csv(yourData, file = "yourStataFile.csv")
Alternatively, visit the link pointed by huntaub in a comment below.
For Stata <= 12 datasets foreign package can also be used
library(foreign)
yourData <- read.dta("yourStataFile.dta")
You can do it in StatTransfer, R or perl (as mentioned by others), but StatTransfer costs $$$ and R/Perl have a learning curve.
There is a free, menu-driven stats program from AM Statistical Software that can open and convert Stata .dta from all versions of Stata, see:
http://am.air.org/
I have not tried, but if you know Perl you can use the Parse-Stata-DtaReader module to convert the file for you.
The module has a command-line tool dta2csv, which can "convert Stata 8 and Stata 10 .dta files to csv"
Another way of converting between pretty much any data format using R is with the rio package.
Install R from CRAN and open R
Install the rio package using install.packages("rio")
Load the rio library, then use the convert() function:
library("rio")
convert("my_file.dta", "my_file.csv")
This method allows you to convert between many formats (e.g., Stata, SPSS, SAS, CSV, etc.). It uses the file extension to infer format and load using the appropriate importing package. More info can be found on the R-project rio page.
The R method will work reliably, and it requires little knowledge of R. Note that the conversion using the foreign package will preserve data, but may introduce differences. For example, when converting a table without a primary key, the primary key and associated columns will be inserted during the conversion.
From http://www.r-bloggers.com/using-r-for-stata-to-csv-conversion/ I recommend:
library(foreign)
write.table(read.dta(file.choose()), file=file.choose(), quote = FALSE, sep = ",")
In Python, one can use statsmodels.iolib.foreign.genfromdta to read Stata datasets. In addition, there is also a wrapper of the aforementioned function which can be used to read a Stata file directly from the web: statsmodels.datasets.webuse.
Nevertheless, both of the above rely on the use of the pandas.io.stata.StataReader.data, which is now a legacy function and has been deprecated. As such, the new pandas.read_stata function should now always be used instead.
According to the source file of stata.py, as of version 0.23.0, the following are supported:
Stata data file versions:
104
105
108
111
113
114
115
117
118
Valid encodings:
ascii
us-ascii
latin-1
latin_1
iso-8859-1
iso8859-1
8859
cp819
latin
latin1
L1
As others have noted, the pandas.to_csv function can then be used to save the file into disk. A related function numpy.savetxt can also save the data
as a text file.
EDIT:
The following details come from help dtaversion in Stata 15.1:
Stata version .dta file format
----------------------------------------
1 102
2, 3 103
4 104
5 105
6 108
7 110 and 111
8, 9 112 and 113
10, 11 114
12 115
13 117
14 and 15 118 (# of variables <= 32,767)
15 119 (# of variables > 32,767, Stata/MP only)
----------------------------------------
file formats 103, 106, 107, 109, and 116
were never used in any official release.
StatTransfer is a program that moves data easily between Stata, Excel (or csv), SAS, etc. It is very user friendly (requires no programming skills). See www.stattransfer.com
If you use the program just note that you will have to choose "ASCII/Text - Delimited" to work with .csv files rather than .xls
Some mentioned SPSS, StatTransfer, they are not free. R and Python (also mentioned above) may be your choice. But personally, I would like to recommend Python, the syntax is much more intuitive than R. You can just use several command lines with Pandas in Python to read and export most of the commonly used data formats:
import pandas as pd
df = pd.read_stata('YourDataName.dta')
df.to_csv('YourDataName.csv')
SPSS can also read .dta files and export them to .csv, but that costs money. PSPP, an open source version of SPSS, which is rough, might also be able to read/export .dta files.
PYTHON - CONVERT STATA FILES IN DIRECTORY TO CSV
import glob
import pandas
path=r"{Path to Folder}"
for my_dir in glob.glob("*.dta")[0:1]:
file = path+my_dir # collects all the stata files
# get the file path/name without the ".dta" extension
file_name, file_extension = os.path.splitext(file)
# read your data
df = pandas.read_stata(file, convert_categoricals=False, convert_missing=True)
# save the data and never think about stata again :)
df.to_csv(file_name + '.csv')
For those who have Stata (even though the asker does not) you can use this:
outsheet produces a tab-delimited file so you need to specify the comma option like below
outsheet [varlist] using file.csv , comma
also, if you want to remove labels (which are included by default
outsheet [varlist] using file.csv, comma nolabel
hat tip to:
http://www.ats.ucla.edu/stat/stata/faq/outsheet.htm