Spark deduplication of RDD to get bigger RDD - duplicates

I have a dataframe loaded from disk
df_ = sqlContext.read.json("/Users/spark_stats/test.json")
It contains 500k rows.
my script works fine on this size, but I want to test it for example on 5M rows, is there a way to duplicate the df 9 times? (it does not matter for me to have duplicates in the df)
i already use union but it is really too slow (as I think it keeps reading from the disk everytime)
df = df_
for i in range(9):
df = df.union(df_)
Do you have an idea about a clean way to do that?
Thanks

You can use explode. It should only read from the raw disk once:
from pyspark.sql.types import *
from pyspark.sql.functions import *
schema = StructType([StructField("f1", StringType()), StructField("f2", StringType())])
data = [("a", "b"), ("c", "d")]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)
# Create an array with as many values as times you want to duplicate the rows
dups_array = [lit(i) for i in xrange(9)]
duplicated = df.withColumn("duplicate", array(*dups_array)) \
.withColumn("duplicate", explode("duplicate")) \
.drop("duplicate")

Related

Is there a easy way to split a large CSV file with multiline entries?

Hi i have this huge 14Gb CSV file with entries that span multiple lines and would like a easy way to split it, BTW the split command will not work cause it is not aware of how many columns there are on a row and will cut it wrong.
Using XSV (https://github.com/BurntSushi/xsv) is very simple:
xsv split -s 10000 ./outputdir inputFile.csv
-s 10000 to set the number of records to write into each chunk.
import os
import pandas as pd
import numpy as np
data_root = r"/home/glauber/Projetos/nlp/"
fname = r"blogset-br.csv.gz"
this_file = os.path.join(data_root,fname)
assert os.path.exists(this_file), this_file
this_file
column_names = ['postid', 'blogid', 'published', 'title', 'content', 'authorid', 'author_displayName', 'replies_totalItems', 'tags']
parse_dates = ['published']
df_iterator = pd.read_csv(this_file,
skiprows=0,
compression='gzip',
chunksize=1000000,
header=None,
names = column_names,
parse_dates=parse_dates,
index_col=1)
new_df = pd.DataFrame()
count = 0
for df in df_iterator:
filename = 'blogset-br-' + str(count ) + '.csv'
df.to_csv(filename)
count += 1
this is the easiest way i could find

Create a Single CSV from Muliple JSON Files

Looking for a way to modify the script, below to produce a single CSV from multiple JSON files. It should include multiple rows, each row returning values for the same fields, but tied to a single JSON (ROW 1 = JSON 1, ROW 2 = JSON 2, etc.). The following produces a CSV with one row of data.
import pandas as pd
df = pd.read_json("pywu.cache.json")
df = df.loc[["station_id", "observation_time", "weather", "temperature_string", "display_location"],"current_observation"].T
df = df.append(pd.Series([df["display_location"]["latitude"], df["display_location"]["longitude"]], index=["latitude", "longitude"]))
df = df.drop("display_location")
print(df['latitude'], df['longitude'])
df = pd.to_numeric(df, errors="ignore")
pd.DataFrame(df).T.to_csv("CurrentObs.csv", index=False, header=False, sep=",")

spark failing to read mysql data and save in hdfs [duplicate]

I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"#"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");
set.take(500)
The whole operation takes 20 to 30 sec.
Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;
String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;
SparkSession spark=StartSpark.getSparkSession();
SQLContext sc = spark.sqlContext();
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set = sc
.read()
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
set.take(500);
This is taking 5 to 10 minutes.
I am running spark inside jvm. Using same configuration in both cases.
I can use partitionColumn,numParttition etc but I don't have any numeric column and one more issue is the schema of the table is unknown to me.
My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?
This problem has been covered multiple times on StackOverflow:
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
spark jdbc df limit... what is it doing?
How to use JDBC source to write and read data in (Py)Spark?
and in external sources:
https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads
so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.
To distribute reads:
use ranges with lowerBound / upperBound:
Properties properties;
Lower
Dataset<Row> set = sc
.read()
.option("partitionColumn", "foo")
.option("numPartitions", "3")
.option("lowerBound", 0)
.option("upperBound", 30)
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
predicates
Properties properties;
Dataset<Row> set = sc
.read()
.jdbc(
url, this.tableName,
{"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
properties
)
Please follow the steps below
1.download a copy of the JDBC connector for mysql. I believe you already have one.
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.38/mysql-connector-java-5.1.38.jar
2.create a db-properties.flat file in the below format
jdbcUrl=jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}
user=<username>
password=<password>
3.create a empty table first where you want to load the data.
invoke spark shell with driver class
spark-shell --driver-class-path <your path to mysql jar>
then import all the required package
import java.io.{File, FileInputStream}
import java.util.Properties
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
initiate a hive context or a sql context
val sQLContext = new HiveContext(sc)
import sQLContext.implicits._
import sQLContext.sql
set some of the properties
sQLContext.setConf("hive.exec.dynamic.partition", "true")
sQLContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Load mysql db properties from file
val dbProperties = new Properties()
dbProperties.load(new FileInputStream(new File("your_path_to/db- properties.flat")))
val jdbcurl = dbProperties.getProperty("jdbcUrl")
create a query to read the data from your table and pass it to read method of #sqlcontext. this is where you can manage your where clause
val df1 = "(SELECT * FROM your_table_name) as s1"
pass the jdbcurl, select query and db properties to read method
val df2 = sQLContext.read.jdbc(jdbcurl, df1, dbProperties)
write it to your table
df2.write.format("orc").partitionBy("your_partition_column_name").mode(SaveMode.Append).saveAsTable("your_target_table_name")

Recursive merging of pandas dataframe from imported csvs

I have seen similar questions being asked and responded to. However, no answer seems to address my specific needs.
The following code, which I took and adapted to suit my needs, successfully imports the files and relevant columns. However it appends rows onto the df and does not merge those columns based on keys.
import glob
import pandas as pd
import os
path = r'./csv_weather_data'
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat(pd.read_csv(f, skiprows=47, skipinitialspace=True, usecols=['Year','Month','Day','Hour','DBT'],) for f in all_files)
Typical data structure is the following:
Year Month Day Hour DBT
1989 1 1 0 7.8
1989 1 1 100 8.6
1989 1 1 200 9.2
I would like to achieve the following:
import all csv files contained in a folder into a pandas df
merge first 4 columns into 1 column of datetime values
merge all imported csv, using newly created datetime value as an index, and adding DBT columns to that, with each DBT column taking the name of the imported csv (it is the Dry Bulb Temperature DBT of that weather file).
Any advice?
You should divide the problem in two steps:
First, define your import function. Here you need to define datetime and set is as index.
def my_import(f):
df = pd.read_csv(f, skiprows=47, skipinitialspace=True, usecols=['Year','Month','Day','Hour','DBT'],)
df.loc[:, 'Date'] = pd.to_datetime(df.apply(lambda x : str(int(x['Year']))+str(int(x['Month']))+str(int(x['Day']))+str(int(x['Hour'])), axis = 1), format = '%Y%m%d%H')
df.drop(['Year', 'Month', 'Day', 'Hour'], axis = 1, inplace = True)
df.set_index('Date')
return df
Then you concatenate by columns (axis = 1)
df = pd.concat({f : my_import(f) for f in all_files}, axis = 1)

How to read a pandas Series from a CSV file

I have a CSV file formatted as follows:
somefeature,anotherfeature,f3,f4,f5,f6,f7,lastfeature
0,0,0,1,1,2,4,5
And I try to read it as a pandas Series (using pandas daily snapshot for Python 2.7).
I tried the following:
import pandas as pd
types = pd.Series.from_csv('csvfile.txt', index_col=False, header=0)
and:
types = pd.read_csv('csvfile.txt', index_col=False, header=0, squeeze=True)
But both just won't work: the first one gives a random result, and the second just imports a DataFrame without squeezing.
It seems like pandas can only recognize as a Series a CSV formatted as follows:
f1, value
f2, value2
f3, value3
But when the features keys are in the first row instead of column, pandas does not want to squeeze it.
Is there something else I can try? Is this behaviour intended?
Here is the way I've found:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.ix[0,:]
Seems like a bit stupid to me as Squeeze should already do this. Is this a bug or am I missing something?
/EDIT: Best way to do it:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.transpose()[0] # here we convert the DataFrame into a Serie
This is the most stable way to get a row-oriented CSV line into a pandas Series.
BTW, the squeeze=True argument is useless for now, because as of today (April 2013) it only works with row-oriented CSV files, see the official doc:
http://pandas.pydata.org/pandas-docs/dev/io.html#returning-series
This works. Squeeze still works, but it just won't work alone. The index_col needs to be set to zero as below
series = pd.read_csv('csvfile.csv', header = None, index_col = 0, squeeze = True)
In [28]: df = pd.read_csv('csvfile.csv')
In [29]: df.ix[0]
Out[29]:
somefeature 0
anotherfeature 0
f3 0
f4 1
f5 1
f6 2
f7 4
lastfeature 5
Name: 0, dtype: int64
ds = pandas.read_csv('csvfile.csv', index_col=False, header=0);
X = ds.iloc[:, :10] #ix deprecated
As Pandas value selection logic is :
DataFrame -> Series=DataFrame[Column] -> Values=Series[Index]
So I suggest :
df=pandas.read_csv("csvfile.csv")
s=df[df.columns[0]]
from pandas import read_csv
series = read_csv('csvfile.csv', header=0, parse_dates=[0], index_col=0, squeeze=True
Since none of the answers above worked for me, here is another one, recreating the Series manually from the DataFrame.
# create example series
series = pd.Series([0, 1, 2], index=["a", "b", "c"])
series.index.name = "idx"
print(series)
print()
# create csv
series_csv = series.to_csv()
print(series_csv)
# read csv
df = pd.read_csv(io.StringIO(series_csv), index_col=0)
indx = df.index
vals = [df.iloc[i, 0] for i in range(len(indx))]
series_again = pd.Series(vals, index=indx)
print(series_again)
Output:
idx
a 0
b 1
c 2
dtype: int64
idx,0
a,0
b,1
c,2
idx
a 0
b 1
c 2
dtype: int64