Wrong encoding when reading csv file with pyspark

Wrong encoding when reading csv file with pyspark - csv

For my course in university, I run pyspark-notebook docker image
docker pull jupyter/pyspark-notebook
docker run -it --rm -p 8888:8888 -v /path/to/my/working/directory:/home/jovyan/work jupyter/pyspark-notebook
And then run next python code
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark
listings_df = spark.read.csv("listings.csv", header=True, mode='DROPMALFORMED')
# adding encoding="utf8" to the line above doesn't help also
listings_df.printSchema()
The problem appears during reading a file. It seems that spark reads my file incorrectly (possibly because of encodings problem?) and after reading listings_df has 16494 lines, while the correct number of lines is 16478 (checked with pandas.read_csv()). You can see that something definitely is broken also by running
listings_df.groupBy("room_type").count().show()
which gives next output
+---------------+-----+
| room_type|count|
+---------------+-----+
| 169| 1|
| 4.88612| 1|
| 4.90075| 1|
| Shared room| 44|
| 35| 1|
| 187| 1|
| null| 16|
| 70| 1|
| 27| 1|
| 75| 1|
| Hotel room| 109|
| 198| 1|
| 60| 1|
| 280| 1|
|Entire home/apt|12818|
| 220| 1|
| 190| 1|
| 156| 1|
| 450| 1|
| 4.88865| 1|
+---------------+-----+
only showing top 20 rows
while real room_type values are only ['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'].
Spark info which might be useful:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.1.2
Master
local[*]
AppName
pyspark-shell
And encoding of the file
!file listings.csv
listings.csv: UTF-8 Unicode text
listings.csv is an Airbnb statistics csv file downloaded from here
All run & drive code I've also uploaded to Colab

There are two things that I've found:
Some lines have quotes to escape (escape='"')
Also #JosefZ has mentioned about unwanted line breaks (multiLine=True)
That's how you must read it:
input_df = spark.read.csv(path, header=True, multiLine=True, escape='"')
output_df = input_df.groupBy("room_type").count()
output_df.show()
+---------------+-----+
| room_type|count|
+---------------+-----+
| Shared room| 44|
| Hotel room| 110|
|Entire home/apt|12829|
| Private room| 3495|
+---------------+-----+

I think encoding the file from here should solve the problem. So you add encoding="utf8" to your tuple of the variable listings_df.
As shown below:
listings_df = spark.read.csv("listings.csv", encoding="utf8", header=True, mode='DROPMALFORMED')

Related

How can I load a JSON file in Brython?

I am working on a page with Brython, and one thing I would like to do is load a JSON file on the same server, in the same directory.
How can I do that? open()? Something else?
Thanks,

You are on the right track, here is how I did it.
In my file .bry.
import json
css_file = "static/json/css.json"
j = json.load(open(css_file))
# For debug
console.log(j)
Since it is from your HTML page that the execution is done, make sure to start from your .html file and not .py or .bry to search your JSON file.
Also, check if your JSON file has the right structure ;)
Here is my tree structure.
| index.html
| README.txt
| unicode.txt
|
+---static
| +---assets
| |
| +---css
| |
| +---fonts
| |
| +---js
| |
| +---json
| | css.json
| |
| \---py
| | main.bpy
Hoping to have helped :D

pyspark - read csv with custom row delimiter

how can I read a csv file with custom row delimiter (\x03) using pyspark?
I tried the following code but it did not work.
df = spark.read.option("lineSep","\x03").csv(path)
display(df)

Works just fine with both OSS Spark (3.2.0) and DBR 9.1 ML:
>>> df = spark.read.option("lineSep","\x03")\
.option("header", "true").csv("/path_to_file.csv")
>>> df.show()
+----+----+
|val1|val2|
+----+----+
| 1| 2|
| 3| 4|
+----+----+
Look for problems inside file, or something like this.

How to apply group by on pyspark dataframe and a transformation on the resulting object

I have a spark data frame
| item_id | attribute_key| attribute_value
____________________________________________________________________________
| id_1 brand Samsung
| id_1 ram 6GB
| id_2 brand Apple
| id_2 ram 4GB
_____________________________________________________________________________
I want to group this data frame by item_id and output as a file with each line being a json object
{id_1: "properties":[{"brand":['Samsung']},{"ram":['6GB']} ]}
{id_2: "properties":[{"brand":['Apple']},{"ram":['4GB']} ]}
This is a big distributed data frame so , converting to pandas is not an option.
Is this kind of transformation even possible in pyspark

In scala, but python version will be very similar (sql.functions):
val df = Seq((1,"brand","Samsung"),(1,"ram","6GB"),(1,"ram","8GB"),(2,"brand","Apple"),(2,"ram","6GB")).toDF("item_id","attribute_key","attribute_value")
+-------+-------------+---------------+
|item_id|attribute_key|attribute_value|
+-------+-------------+---------------+
| 1| brand| Samsung|
| 1| ram| 6GB|
| 1| ram| 8GB|
| 2| brand| Apple|
| 2| ram| 6GB|
+-------+-------------+---------------+
df.groupBy('item_id,'attribute_key)
.agg(collect_list('attribute_value).as("list2"))
.groupBy('item_id)
.agg(map(lit("properties"),collect_list(map('attribute_key,'list2))).as("prop"))
.select(to_json(map('item_id,'prop)).as("json"))
.show(false)
output:
+------------------------------------------------------------------+
|json |
+------------------------------------------------------------------+
|{"1":{"properties":[{"ram":["6GB","8GB"]},{"brand":["Samsung"]}]}}|
|{"2":{"properties":[{"brand":["Apple"]},{"ram":["6GB"]}]}} |
+------------------------------------------------------------------+

adding a unique consecutive row number to dataframe in pyspark

I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?

I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+

Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+

MYSQL 3 tables has amount column sum

Mysql sum 3 column different tables
Table list below
budget
|b_id |amount |
| 1| 100|
| 2| 200|
cash_advance
|ca_id |b_id |ca_amount |
| 1| 1| 100 |
| 2| 2| 200 |
expenses
|exp_id|ca _id|exp_amount|
| 1| 1| 100|
| 2| 2| 40|
| 3| 2| 160|
i want this result
resul
|sum(b_amount)|sum(ca_amount)|sum(exp_amount)|
| 100| 100| 100|
| 200| 200| 200|
any mysql query? thanks

You are trying to access the network on your UI thread. This is bad because it will freeze the UI until the network response has returned. You should do this network access on a separate thread.
There are many options, but the simplest option would be:
Convert msg in MimeMessage msg = new MimeMessage(session); to be final.
Wrap Transport.send(msg); as new Thread(new Runnable() { #Override public void run() { Transport.send(msg); } }).start();

The log is indicating that you're doing network tasks on your main thread, the UI / Activity thread. Use an AsyncTask for those tasks instead.
Android forbids those tasks on your main thread, because they block the UI and make it unusable until your task is finished.

In android 3.0 and higher network connection on the main thread isnt permitted, strictMode is turned on automatically.
To fix this issue, you must perform the network connection on a separate thread... for example, using an AsyncTask, Threads, Handler.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Wrong encoding when reading csv file with pyspark - csv

I think encoding the file from here should solve the problem. So you add encoding="utf8" to your tuple of the variable listings_df. As shown below: listings_df = spark.read.csv("listings.csv", encoding="utf8", header=True, mode='DROPMALFORMED')

Related

How can I load a JSON file in Brython?

pyspark - read csv with custom row delimiter

How to apply group by on pyspark dataframe and a transformation on the resulting object

adding a unique consecutive row number to dataframe in pyspark

MYSQL 3 tables has amount column sum

Categories

Resources