Spark - Strange characters when reading CSV file

Spark - Strange characters when reading CSV file - csv

I hope someone could help me please. My problem is the following:
To read a CSV file in Spark I'm using the code
val df=spark.read.option("header","true").option("inferSchema","true").csv("/home/user/Documents/filename.csv")
assuming that my file is called filename.csv and the path is /home/user/Documents/
To show the first 10 results I use
df.show(10)
but instead I get the following result which contains the character � and not showing the 10 results as desired
scala> df.show(10)
+--------+---------+---------+-----------------+
| c1| c2| c3| c4|
+--------+---------+---------+-----------------+
|��1.0|5450|3007|20160101|
+--------+---------+---------+-----------------+
The CSV file looks something like this
c1 c2 c3 c4
1 5450 3007 20160101
2 2156 1414 20160107
1 78229 3656 20160309
1 34963 4484 20160104
1 7897 3350 20160105
11 13247 3242 20160303
2 4957 3350 20160124
1 73083 4211 20160207
The file that I'm trying to read is big. When I try smaller files I don't get the strange character and I can see the first 10 results without problem.
Any help is appreciated.

Sometimes it is not the problem caused by settings of Spark. Try to re-save(save as) your CSV file as "CSV UTF-8 (comma delimited)", then rerun your code, the strange characters will gone. I had similar problem when read some CSV file containing German words, then I did above, it is all good.

Related

Can't display CSV file in pyspark(ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling)

I'm getting an error while displaying a CSV file through Pyspark. I've attached the PySpark code and CSV file that I used.
from pyspark.sql import *
spark.conf.set("fs.azure.account.key.xxocxxxxxxx","xxxxx")
time_on_site_tablepath= "wasbs://dwpocblob#dwadfpoc.blob.core.windows.net/time_on_site.csv"
time_on_site = spark.read.format("csv").options(header='true', inferSchema='true').load(time_on_site_tablepath)
display(time_on_site.head(50))
The error is shown below
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
CSV file format is attached below
time_on_site:pyspark.sql.dataframe.DataFrame
next_eventdate:timestamp
barcode:integer
eventdate:timestamp
sno:integer
eventaction:string
next_action:string
next_deviceid:integer
next_device:string
type_flag:string
site:string
location:string
flag_perimeter:integer
deviceid:integer
device:string
tran_text:string
flag:integer
timespent_sec:integer
gg:integer
CSV file data is attached below
next_eventdate,barcode,eventdate,sno,eventaction,next_action,next_deviceid,next_device,type_flag,site,location,flag_perimeter,deviceid,device,tran_text,flag,timespent_sec,gg
2018-03-16 05:23:34.000,1998296,2018-03-14 18:50:29.000,1,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,124385,0
2018-03-17 07:22:16.000,1998296,2018-03-16 18:41:09.000,3,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,45667,0
2018-03-19 07:23:55.000,1998296,2018-03-17 18:36:17.000,6,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,1,132458,1
2018-03-21 07:25:04.000,1998296,2018-03-19 18:23:26.000,8,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,133298,0
2018-03-24 07:33:38.000,1998296,2018-03-23 18:39:04.000,10,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,46474,0
What could be done to load the CSV file successfully?

There is no issue in your syntax, it's working fine.
The issue is in your data of CSV file, where the column named as type_flag have only None(null) values, So it doesn't infer it's Datatype.
So, here are two options.
you can display the data without using head(). Like
display(time_on_site)
If you want to use head() then you need to replace the null value, at here I replaced it with the empty string('').
time_on_site = time_on_site.fillna('')
display(time_on_site.head(50))

For some reason, probably a bug, even if you provide a schema on the spark.read.schema(my_schema).csv('path') call
you get the same error on a display(df.head()) call
the display(df) works though, but it gave me a WTF moment.

pandas returning empty DataFrames for CSV

I have some large csv and xlsx files which I need to set up pandas DataFrames for. I have code which locates these files within the directory (when printed, these show correct pathnames). These paths are then passed to a helper function which is meant to set up the required DataFrames for the files, then the data will be passed to other functions for some manipulation. I intend to have the data written to a file (by loading a template, writing the data to it, and saving this file) once this is completed.
I currently have code like:
import pandas
# some set-up functions (which work; verified using print statements)
def createDataFrame(filename):
if filename.endswith('.csv'):
df = pandas.read_csv(StringIO(filename), skip_blank_lines=True, index_col=False,
encoding="utf-8", skipinitialspace=True)
When I try print(df), I get:
Empty DataFrame
Columns: [a.csv]
Index: []
and print(StringIO(filename)) gives me:
<_io.StringIO object at 0x004D1990>
However, when I leave out the StringIO() around filename in the function, I get this error:
OSError: File b'a.csv' does not exist
Everywhere that I've been able to find information on this has either just said import and start using, or talks about using read_csv() rather than from_csv() (from this question, which wasn't very helpful here), and even the current pandas docs basically say that it should be as easy as passing the file to pandas.read_csv().
1) I've checked that I have full permissions and that the file is valid and exists. Why am I getting the OSError?
2) When I use StringIO(), why am I still getting an empty DataFrame here? How can I fix this?
Thanks in advance.

I have solved this.
StringIO was the root cause of this problem. Because I'm on Windows, os.path.is_file() was returning False, and I got the error:
OSError: File b'a.csv' does not exist
It wasn't until I stumbled upon this page from the Python 2.5 docs that I discovered that the call should actually be os.path.isfile() on Windows because it uses ntpath behind the scenes. This is to better handle the difference in pathnames between systems (Windows uses '\', Unix uses '/').
Because I had something weird going on in my paths, pandas was unable to properly load the CSV files into DataFrames.
By simply changing my code from this:
import pandas
# some set-up functions (which work; verified using print statements)
def createDataFrame(filename):
if filename.endswith('.csv'):
df = pandas.read_csv(StringIO(filename), skip_blank_lines=True, index_col=False,
encoding="utf-8", skipinitialspace=True)
to this:
import pandas
# some set-up functions (which have been updated)
def createDataFrame(filename):
basepath = config.complete_datasets_dir
fullpath = os.path.join(basepath, filename)
if filename.endswith('.csv'):
df = pandas.read_csv(fullpath, skip_blank_lines=True, index_col=False,
encoding="utf-8", skipinitialspace=True)
and appropriately updating the function which calls that function:
def somefunc():
dfs = []
data_lists = getInputFiles() # checks data directory for files containing info
for item in data_lists:
tdata = createDataFrames(item)
dfs.append(tdata)
print(dfs)
I was able to get the output I was looking for:
[ 1 2 3 4 5 6 7 8 9 10
0 11 12 13 14 15 16 17 18 19 20
1 21 22 23 24 25 26 27 28 29 30
2 31 32 33 34 35 36 37 38 39 40, 1 2 3 4 5 6 7 8 9 10
0 11 12 13 14 15 16 17 18 19 20
1 21 22 23 24 25 26 27 28 29 30]
which is a list of two DataFrames, the first of which came from a CSV containing only the numbers 1-40 (on 4 rows total, no headers); the second file contains only the numbers 1-30 (formatted the same way).
I hope this helps someone in the future.

Pandas read_csv errors on number of fields, but visual inspection looks fine

I'm trying to load a large csv file, 3,715,259 lines.
I created this file myself and there are 9 fields separated by commas.
Here's the error:
df = pd.read_csv("avaya_inventory_rev2.csv", error_bad_lines=False)
Skipping line 2924525: expected 9 fields, saw 11
Skipping line 2924526: expected 9 fields, saw 10
Skipping line 2924527: expected 9 fields, saw 10
Skipping line 2924528: expected 9 fields, saw 10
This doesn't make sense to me, I inspected the offending lines using:
sed -n "2924524,2924525p" infile.csv
I can't list the outputs as they contain proprietary information for a client. I'll try to synthesize a meaningful replacement.
Lines 2924524 and 2924525 look to have to same number of fields to me.
Also, I was able to load the same file into a mySQL table with no error.
create table Inventory (path varchar (255), isText int, ext varchar(5), type varchar(100), size int, sloc int, comments int, blank int, tot_lines int);
I don't know enough about mySQL to understand why that may or maynot be a valid test and why pandas would have a different outcome from loading the same file.
TIA !
'''UPDATE''': I tried to read with the engine='python':
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
When I create this csv, I'm using a shell script I wrote. I feed lines to the file with redirect >>
I tried the suggested fix:
input = open(input, 'rU')
df.read_csv(input, engine='python')
Back to the same error:
ValueError: Expected 9 fields in line 5157, saw 11
I'm guessing it has to do with my csv creation script and how I dealt with
quoting in that. I don't know how to investigate this further.
I opened the csv input file in vim and on line 5157 there's a ^M which google says it Windows CR.
OK...I'm closer, although I did kinda suspect something like this and used dos2unix on the csv intput.
I removed the ^M using vim, and re-ran with same error about
11 fields. However, I can now see the 11 fields where as before I just saw
9. There's v's which is likely some kind of Windows hold over ?
SUMMARY: SOme somebody thought it'd be cute to name files with fobar.sh,v
So my profiler didn't mess up it's was just a name weirdness...plus the random \cr\lf from windows that snuck in....
Cheers

SAS Infile statement errors when trying to read in CSV file

My data looks like the following (this is a really small subset). This is a CSV file that's actually column seperated, so it can be read quite easily in Excel.
ParentFlag CurrentBalanceF ValuationAmountO ValuationDateO ValuationDateC
PARENT 85481.49 145000 13/02/2004 30/04/2009
I'm trying to use the following code to import my data.
filename indata '&location.\AE&CO - inputs - 2014 09 30 Sept.csv';
Data treasury;
Infile indata firstobs=2 dlm=" "
/* delimiter=','*/
;
/* Length ext_acno $14.;*/
/* informat original_val_dt date9. current_val_dt date9. ;*/
input pcd_acno $ 1 ext_acno $ 2 loan_acno $ 3 acno $ 4 account_bal $ 5 trust_id $ 6 parentflag $ 7 account_bal_f 8
original_val_amt 9 original_val_dt 10 current_val_dt 11 original_val_type 12
current_val_type 13 indexed_ltv 14 original_ltv_wo_fees 15 latest_ltv 16 account_status_rbs $ 17 ;
;
run;
However, the log gives me errors and the data doesn't import properly. My data set has fields that only have one character visible (for example, the parentflag field above only has a 0).
I tried doing this using the import wizard, and it worked to a certain extent, but the log comes up with an "import unsuccessful" message, despite my table populating correctly...
Ideally I'd like to get the infile statement working because it feels like a sturdier substitute. For now, it's just not behaving and I've no idea why! Could someone help?

You need to remove those numbers after the dollar signs.
input pcd_acno $ ext_acno $ loan_acno $ ... ;
The numbers are just confusing SAS, they don't serve any purpose. They make SAS think that you're trying to do some sort of column input, but you want list input here. Column means the one character long column, not what you're using it to mean; "CSV" means "Comma separated values", not "column".
IE:
ABCDE FGHIJ KLMNO
Column 5 in the above is "E", and column 9 is "H".

How to merge two ipython notebooks correctly without getting json error?

I have tried:
cat file1.ipynb file2.ipynb > filecomplete.ipynb
since the notebooks are simply json files, but this gives me the error
Unreadable Notebook: Notebook does not appear to be JSON: '{\n "metadata": {'
I think these must be valid json files because file1 and file2 each load individually into nbviewer, and so I am not entirely sure what I am doing wrong.

This Python script concatenates all the notebooks named with a given prefix and present at the first level of a given folder. The resulting notebook is saved in the same folder under the name "compil_" + prefix + ".ipynb".
import json
import os
folder = "slides"
prefix = "quiz"
paths = [os.path.join(folder, name) for name in os.listdir(folder) if name.startswith(prefix) and name.endswith(".ipynb")]
result = json.loads(open(paths.pop(0), "r").read())
for path in paths:
result["worksheets"][0]["cells"].extend(json.loads(open(path, "r").read())["worksheets"][0]["cells"])
open(os.path.join(folder, "compil_%s.ipynb" % prefix), "w").write(json.dumps(result, indent = 1))
Warning: the metadata are those of the first notebook, and the cells those of the first worksheet only (which seems to contain all the cells, in my notebook at least).

Concatenating 2 object with some properties does not always yield object with the same properties. Here is a sequence of increasing number : 4 8 15 16 23 42, here is another one 1 2 3 4 5 6 7. The concatenation of the two is not strictly increasing :4 8 15 16 23 42 1 2 3 4 5 6 7. Same goes for Json.
You need to load json file using json lib and do the merge you want to do yourself. I suppose you "just" want to concatenate the cells, but maybe you want to concatenate worksheet; maybe you want to merge metadata.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008