How to create a doubly-nested dict from CSV? - csv

I am trying to read a Financial CSV with Python, and the data looks like this:
Company1;2018;12345;67890;
Company1;2019;34242;12313;
Company2;2018;12412;32423;
Company3;2017;12314;23554;
...
What I am searching for is a function that give me following result after reading this CSV:
Dict2 = {
Company1: { 2018: { Costs: 123, employes: 1231}
2019: { Costs: 231, employes: 1321}}
Company2: { 2019: { Costs: 123, employes: 1231}
Company3: { 2019: { Costs: 123, employes: 1231}
}
I am processing the CSV like this:
file2 = open(pfad_ordner + "\daten\standortdaten\FirmenBilanz.csv", "r")
reader = csv.reader(file2, delimiter =";")
Dict2 = {}
for row in reader:
Dict2[row[0]] = {"Jahr":row[2], "Ort":row[1], "Mitarbeiter_gewerblich": row[3]}
if i do it this way, python ignores the row with same company name. or lets better say, it updates the dictionary so only one row got stored per company key.

The defaultdict class from the collections module can help you out.
You'll create Dict2 to be a dictionary that's meant to store other dictionaries:
Dict2 = defaultdict(dict)
Now, you can supply the company's name as a key, and give that key a value that is your "sub dict" of the year with the other values as a dictionary... all in one statement:
Dict2['Foo, inc.']['2018'] = {'Cost': 23, 'Employees': 9}
Here it is put together:
import csv
from collections import defaultdict
file2 = open(pfad_ordner + "\daten\standortdaten\FirmenBilanz.csv", "r")
reader = csv.reader(file2, delimiter =";")
Dict2 = defaultdict(dict)
for row in reader:
name = row[0]
year = row[1]
Dict2[name][year] = { 'other values from row' }

Related

How to create multiple DataFrames from a multiple lists in Scala Spark

I'm trying to create multiple DataFrames from the two lists below,
val paths = ListBuffer("s3://abc_xyz_tableA.json",
"s3://def_xyz_tableA.json",
"s3://abc_xyz_tableB.json",
"s3://def_xyz_tableB.json",
"s3://abc_xyz_tableC.json",....)
val tableNames = ListBuffer("tableA","tableB","tableC","tableD",....)
I want to create different dataframes using the table names by bringing all the common table name ending s3 paths together as they have the unique schema.
so for example if the tables and paths related to it are brought together then -
"tableADF" will have all the data from these paths "s3://abc_xyz_tableA.json", "s3://def_xyz_tableA.json" as they have "tableA" in the path
"tableBDF" will have all the data from these paths "s3://abc_xyz_tableB.json", "s3://def_xyz_tableB.json" as they have "tableB" in the path
and so on there can be many tableNames and Paths
I'm trying different approaches but not successful yet.
Any leads in achieving the desired solution will be of great help. Thanks!
using input_file_name() udf, you can filter based on the file names to get the dataframe for each file/file patterns
import org.apache.spark.sql.functions._
import spark.implicits._
var df = spark.read.format("json").load("s3://data/*.json")
df = df.withColumn(
"input_file", input_file_name()
)
val tableADF= df.filter($"input_file".endsWith("tableA.json"))
val tableBDF= df.filter($"input_file".endsWith("tableB.json"))
If the file post fix name list is pretty long then you an use something as below,
Also find the code explanation inline
import org.apache.spark.sql.functions._
object DFByFileName {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Load your JSON data
var df = spark.read.format("json").load("s3://data/*.json")
//Add a column with file name
df = df.withColumn(
"input_file", (input_file_name())
)
//Extract unique file postfix from the file names in a List
val fileGroupList = df.select("input_file").map(row => {
val fileName = row.getString(0)
val index1 = fileName.lastIndexOf("_")
val index2 = fileName.lastIndexOf(".")
fileName.substring(index1 + 1, index2)
}).collect()
//Iterate file group name to map of (fileGroup -> Dataframe of file group)
fileGroupList.map(fileGroupName => {
df.filter($"input_file".endsWith(s"${fileGroupName}.json"))
//perform dataframe operations
})
}
}
Check below code & Final result type is
scala.collection.immutable.Map[String,org.apache.spark.sql.DataFrame] = Map(tableBDF -> [...], tableADF -> [...], tableCDF -> [...]) where ... is your column list.
paths
.map(path => (s"${path.split("_").last.split("\\.json").head}DF",path)) // parsing file names and extracting table name and path into tuple
.groupBy(_._1) // grouping paths based same table name
.map(p => (p._1 -> p._2.map(_._2))).par // combining paths for same table into list and also .par function to execute subsequent steps in Parallel
.map(mp => {
(
mp._1, // table name
mp._2.par // For same DF multiple Files load parallel.
.map(spark.read.json(_)) // loading files s3
.reduce(_ union _) // union if same table has multiple files.
)
}
)

How to check for specific field values based on some condition while converting csv file to json format

Below is the code to convert csv file to json format in python.
I have two fields 'recommendation' and 'rating'. Based on the recommendation value I need to set the value for rating field like if recommendation is 1 then rating =1 and vice versa. With the answer I got I'm getting output for only one record entry instead of getting all the records. I think it's overriding. Do I need to create separate list for that and append each record entry to the list to get the output for all records.
here's the updated code:
def main(input_file):
csv_rows = []
with open(input_file, 'r') as csvfile:
reader = csv.DictReader(csvfile, delimiter='|')
title = reader.fieldnames
for row in reader:
entry = OrderedDict()
for field in title:
entry[field] = row[field]
[c.update({'RATING': c['RECOMMENDATIONS']}) for c in reader]
csv_rows.append(entry)
with open(json_file, 'w') as f:
json.dump(csv_rows, f, sort_keys=True, indent=4, ensure_ascii=False)
f.write('\n')
I want to create the nested format like the below:
"rating": {
"user_rating": {
"rating": 1
},
"recommended": {
"rating": 1
}
After you've read the file in, using the csv.DictReader, you'll have a list of dicts. Since you want to set the values now, it's a simple dict manipulation. There are several ways, of which one is:
[c.update({'rating': c['recommendation']}) for c in read_csvDictReader]
Hope that helps.

Converting csv RDD to map

I have a large CSV( > 500 MB), which I take into a spark RDD, and I want to store it to a large Map[String, Array[Long]].
The CSV has multiple columns but I require only two for the time being. The first and second column, and is of the form:
A 12312 [some_value] ....
B 123123[some_value] ....
A 1222 [some_value] ....
C 1231 [some_value] ....
I want my map to basically group by the string and store an array of long
so, for the above case, my map would be:
{"A": [12312, 1222], "B": 123123, "C":1231}
But since this map would be huge, I can't simply do this directly. tsca
I take the CSV in a sql.dataframe
My code so far(Looks incorrect though):
def getMap(df: sql.DataFrame, sc: SparkContext): RDD[Map[String, Array[Long]]] = {
var records = sc.emptyRDD[Map[String, Array[Long]]]
val rows: RDD[Row] = df.rdd
rows.foreachPartition( iter => {
iter.foreach(x =>
if(records.contains(x.get(0).toString)){
val arr = temp_map.getOrElse()
records = records + (x.get(0).toString -> (temp_map.getOrElse(x.get(0).toString) :+ x.get(1).toString.toLong))
}
else{
val arr = new Array[Long](1)
arr(0) = x.get(1).toString.toLong
records = records + (x.get(0).toString -> arr)
}
)
})
}
Thanks in advance!
If I understood your question correctly then
You could groupBy first column and collect_list for the second column
import org.apache.spark.sql.functions._
val newDF = df.groupBy("column1").agg(collect_list("column2"))
newDF.show(faslse)
val rdd = newDF.rdd.map(r => (r.getString(0), r.getAs[List[Long]](1)))
This will give you RDD[(String, List[Long])] where the string will be unique

Retrieve data by Ignoring null values and header row from csv file

Working on Groovy Script in soapui 5.3.0 and facing the below issue while extracting the values from file to a list.
Purpose of below code is, the list retrieved has to be compared with another list with valid values only.
Attaching the code snippet and the sample csv file for reference.
code to retrieve the values:
def DBvalue= context["csvfile"] //csv file containing the data
def count= context["dbrowcount"] //here the rowcount is 23
for (i=0;i<count;i++) {
def lines= ""
lines= DBvalue.text.split('\n')
list<string> rows = lines.collect{it.split(';)}
log.info "list is"+rows
}
Sample CSV file on which am working contains 600 column of data with 23 rows
abc;null;1;2;3;5;8;null
cdf;null;2;3;6;null;5;6
hgf;null;null;null;jr;null;II
Currently my code is fetching the below output:
[[abc,null,1,2,3,5,8,null]]
[[abc,null,1,2,3,5,8,null]]
[[abc,null,1,2,3,5,8,null]]
Desired output:
[1,2,3,5,8]
[2,3,6,5,6]
[jr,II]
You should be able achieve it with below, and follow in-line comments.
//Provide your file path; change if needed
def file = new File('/tmp/test.csv')
//To hold all the rows
def list = []
//Change delimiter if needed
def delimiter = ';'
file.readLines().eachWithIndex { line, index ->
if (index) {
//Get the row data by split, filter
def lineData = line.split(delimiter).findAll { 'null' != it && it}
log.info lineData
list << lineData
}
}
//Print all the row data
log.info list
Input:
Output:

Spark RDD to CSV - Add empty columns

I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?
For example, say I have an RDD with two elements (scala notation):
Map("a"->1, "b"->2)
Map("b"->1, "c"->3)
I would like to end up with this csv:
a,b,c
1,2,0
0,1,3
Scala solutions are better but any other Spark-compatible language would do.
EDIT:
I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:
Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)
I need to write the csv:
a,b
1,2
3,1
Would it be possible to do this with only one collect?
If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.
You could use aggregate to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.
rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })
You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.
Scala and any other supported language
You can use spark-csv
First lets find all present columns:
val cols = sc.broadcast(rdd.flatMap(_.keys).distinct().collect())
Create RDD[Row]:
val rows = rdd.map {
row => { Row.fromSeq(cols.value.map { row.getOrElse(_, 0) })}
}
Prepare schema:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(
cols.value.map(field => StructField(field, IntegerType, true)))
Convert RDD[Row] to Data Frame:
val df = sqlContext.createDataFrame(rows, schema)
Write results:
// Spark 1.4+, for other versions see spark-csv docs
df.write.format("com.databricks.spark.csv").save("mycsv.csv")
You can do pretty much the same thing using other supported languages.
Python
If you use Python and final data fits in a driver memory you can use Pandas through toPandas() method:
rdd = sc.parallelize([{'a': 1, 'b': 2}, {'b': 1, 'c': 3}])
cols = sc.broadcast(rdd.flatMap(lambda row: row.keys()).distinct().collect())
df = sqlContext.createDataFrame(
rdd.map(lambda row: {k: row.get(k, 0) for k in cols.value}))
df.toPandas().save('mycsv.csv')
or directly:
import pandas as pd
pd.DataFrame(rdd.collect()).fillna(0).save('mycsv.csv')
Edit
One possible way to the second collect is to use accumulators to either build a set of all column names or to count these where you found zeros and use this information to map over rows and remove unnecessary columns or to add zeros.
It is possible but inefficient and feels like cheating. The only situation when it makes some sense is when number of zeros is very low, but I guess it is not the case here.
object ColsSetParam extends AccumulatorParam[Set[String]] {
def zero(initialValue: Set[String]): Set[String] = {
Set.empty[String]
}
def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = {
s1 ++ s2
}
}
val colSetAccum = sc.accumulator(Set.empty[String])(ColsSetParam)
rdd.foreach { colSetAccum += _.keys.toSet }
or
// We assume you know this upfront
val allColnames = sc.broadcast(Set("a", "b", "c"))
object ZeroColsParam extends AccumulatorParam[Map[String, Int]] {
def zero(initialValue: Map[String, Int]): Map[String, Int] = {
Map.empty[String, Int]
}
def addInPlace(m1: Map[String, Int], m2: Map[String, Int]): Map[String, Int] = {
val keys = m1.keys ++ m2.keys
keys.map(
(k: String) => (k -> (m1.getOrElse(k, 0) + m2.getOrElse(k, 0)))).toMap
}
}
val accum = sc.accumulator(Map.empty[String, Int])(ZeroColsParam)
rdd.foreach { row =>
// If allColnames.value -- row.keys.toSet is empty we can avoid this part
accum += (allColnames.value -- row.keys.toSet).map(x => (x -> 1)).toMap
}