Spark can't get delimiter for CSV file - csv

I have a CSV file like this CSV read by pandas like this
But when I read it with PySpark, it turned out like this
CSV read by PySpark
What's wrong with the delimiter in Spark and how can I fix it?

From the posted images, %2C, which is URL encode equivalent of ,, seems to be your delimiter.
Set delimiter to %2C and also use header option:
df = spark.read.option("header",True).option("delimiter", "%2C").csv(path)
Input CSV File:
date%2Copening%2Chigh%2Clow%2Cclose%2Cadjclose%2Cvolume
2022-12-09%2C100%2C101%2C99%2C99.5%2C99.5%2C10000000
2022-12-09%2C200%2C202%2C199%2C199%2C199.1%2C20000000
2022-12-09%2C300%2C303%2C299%2C299%2C299.2%2C30000000
Output dataframe:
+----------+-------+----+---+-----+--------+--------+
|date |opening|high|low|close|adjclose|volume |
+----------+-------+----+---+-----+--------+--------+
|2022-12-09|100 |101 |99 |99.5 |99.5 |10000000|
|2022-12-09|200 |202 |199|199 |199.1 |20000000|
|2022-12-09|300 |303 |299|299 |299.2 |30000000|
+----------+-------+----+---+-----+--------+--------+

Related

make a list of a list out of the header from a csv file

I want to put the header of the csv file in a nested list.
It should have ann output like this:
[[name], [age], [""], [""]]
how can I do this without reading the line again (I am not allowed to and I also am not allowed to use csv module and pandas (all imports except os are forbidden))
Just map the item of the list to list. Check below
def value_to_list(tlist):
l=len(tlist)
for i in range(l):
tlist[i]=[tlist[i]]
return tlist
headers=[]
with open(r"D:\my_projects\DemoProject\test.csv","r") as file :
headers = value_to_list(file.readline().split(","))
test.csv file is "col1,col2,col3"
output :
> python -u "run.py"
[['col1'], ['col2'], ['col3']]
>

Is there a way to programmatically set a dataset's schema from a .csv

As an example, I have a .csv which uses the Excel dialect which uses something like Python's csv module doubleQuote to escape quotes.
For example, consider the row below:
"XX ""YYYYYYYY"", ZZZZZZ ""QQQQQQ""","JJJJ ""MMMM"", RRRR ""TTTT""",1234,RRRR,60,50
I would want the schema to then become:
[
'XX "YYYYYYYY", ZZZZZZ "QQQQQQ"',
'JJJJ "MMMM", RRRR "TTTT"',
1234,
'RRRR',
60,
50
]
Is there a way to set the schema of a dataset in a programmatic/automated fashion?
While you can do this in code, foundrys dataset-app can also do this natively. This means you can skip writing the code (which is nice) but also means you can potentially save a step in your pipeline (which might save you on runtime.)
After uploading the files to a dataset, press "edit schema" on the dataset:
Then apply settings like the following, which would result in the desired outcome in your case:
Then press "save and validate" and the dataset should end up with the correct schema:
Starting with this example:
Dataset<Row> dataset = files
.sparkSession()
.read()
.option("inferSchema", "true")
.csv(csvDataset);
output.getDataFrameWriter(dataset).write();
Add the header, quote, and escape options, like so:
Dataset<Row> dataset = files
.sparkSession()
.read()
.option("inferSchema", "true")
.option("header", "true")
.option("quote", "\"")
.option("escape", "\"")
.csv(csvDataset);
output.getDataFrameWriter(dataset).write();

Load pipe delimited CSV data having " (double quote) in one of the column in hive

I have data as below:-
Rollno|Name|height|department
101|Aman|5"2|C.S.E
Taking all the columns as string.
When I am loading above data in hive I am getting extra quote at start and end as below:-
Rollno:-"101
Name:-Aman
Height:-5"2
Department:-C.S.E"
Can anyone help me with the solution.
Specify your separator such as:
val df = spark.read.option("header","true").option("inferSchema","true").option("sep", "|").csv("test.csv")
df.show(false)
+------+----+------+----------+
|Rollno|Name|height|department|
+------+----+------+----------+
|101 |Aman|5"2 |C.S.E |
+------+----+------+----------+

How to remove unwanted delimiter in json using python

jsonValue="{'Employee': ['{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}', '{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}', '{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}']}
"
with open("F://IDP Umesh//Data Transformation//test.json", 'w') as jsonFile:
jsonFile.write(json.dumps(jsonValue))
Out put from test.json
{"Employee": ["{\"userId\":\"rirani\",\"jobTitleName\":\"Developer\",\"firstName\":\"Romin\",\"lastName\":\"Irani\",\"preferredFullName\":\"Romin Irani\",\"employeeCode\":\"E1\",\"region\":\"CA\",\"phoneNumber\":\"408-1234567\",\"emailAddress\":\"romin.k.irani#gmail.com\"}", "{\"userId\":\"nirani\",\"jobTitleName\":\"Developer\",\"firstName\":\"Neil\",\"lastName\":\"Irani\",\"preferredFullName\":\"Neil Irani\",\"employeeCode\":\"E2\",\"region\":\"CA\",\"phoneNumber\":\"408-1111111\",\"emailAddress\":\"neilrirani#gmail.com\"}", "{\"userId\":\"thanks\",\"jobTitleName\":\"Program Directory\",\"firstName\":\"Tom\",\"lastName\":\"Hanks\",\"preferredFullName\":\"Tom Hanks\",\"employeeCode\":\"E3\",\"region\":\"CA\",\"phoneNumber\":\"408-2222222\",\"emailAddress\":\"tomhanks#gmail.com\"}"]}
How to remove '\' from the json content and make the valid json ?
Appreciate if anyone can help on this?
Thanks
Try this.
import json
jsonValue={'Employee': ['{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}', '{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}', '{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}']}
jsonValue['Employee'] = [json.loads(i ) for i in jsonValue['Employee']]
print(jsonValue)
with open("test.json", 'w') as jsonFile:
jsonFile.write(json.dumps(jsonValue))
The problem with your code is that you're dumping a string formatted as a json, dumps works when you need to convert a dict to a json formatted string.

read_csv file in pandas reads whole csv file in one column

I want to read csvfile in pandas. I have used function:
ace = pd.read_csv('C:\\Users\\C313586\\Desktop\\Daniil\\Daniil\\ACE.csv',sep = '\t')
And as output I got this:
a)First row(should be header)
_AdjustedNetWorthToTotalCapitalEmployed _Ebit _StTradeRec _StTradePay _OrdinaryCf _CfWorkingC _InvestingAc _OwnerAc _FinancingAc _ProdValueGrowth _NetFinancialDebtTotalAdjustedCapitalEmployed_BanksAndOtherInterestBearingLiabilitiesTotalEquityAndLiabilities _NFDEbitda _DepreciationAndAmortizationProductionValue _NumberOfDays _NumberOfDays360
#other rows separated by tab
0 5390\t0000000000000125\t0\t2013-12-31\t2013\tF...
1 5390\t0000000000000306\t0\t2015-12-31\t2015\tF...
2 5390\t00000000000003VG\t0\t2015-12-31\t2015\tF...
3 5390\t0000000000000405\t0\t2016-12-31\t2016\tF...
4 5390\t00000000000007VG\t0\t2013-12-31\t2013\tF...
5 5390\t0000000000000917\t0\t2015-12-31\t2015\tF...
6 5390\t00000000000009VG\t0\t2016-12-31\t2016\tF...
7 5390\t0000000000001052\t0\t2015-12-31\t2015\tF...
8 5390\t00000000000010SG\t0\t2015-12-31\t2015\tF...
Do you have any ideas why it happens? How can I fix it?
You should use the argument sep=r'\t' (note the extra r). This will make pandas search for the exact string \t (the r stands for raw)