PySpark write CSV quote all non-numeric - csv

Is there a way to quote only non-numeric columns in the dataframe when output to CSV file using df.write.csv('path')?
I know you can use the option quoteAll=True to quote all the columns but I only want to quote the string columns.
I am using PySpark 2.2.0.

I only want to quote the string columns.
There is currently no parameter in write.csv that you can use to specify which columns to quote. However, one workaround is to modify your string columns by adding quotes around the values.
First identify the string columns by iterating over the dtypes
string_cols = [c for c, t in df.dtypes if t == "string"]
Now you can modify these columns by adding a quote as a prefix and suffix:
from pyspark.sql.functions import col, lit, concat
cols = [
concat(lit('"'), col(c), lit('"')) if c in string_cols else col(c)
for c in df.columns
]
df = df.select(*cols)
Finally write out the csv:
df.write.csv('path')

Related

JSON- Regex to identify a pattern in JSON

I'm new to Python3 and I am working with large JSON objects. I have a large JSON object which has extra chars coming in between two JSON objects, in between the braces.
For example:
{"id":"121324343", "name":"foobar"}3$£_$£rvcfddkgga£($(>..bu&^783 { "id":"343554353", "name":"ABCXYZ"}'
These extra chars could be anything alphanumeric, special chars or ASCII. They appear in this large JSON multiple times and can be of any length. I'm trying to use regex to identify that pattern to remove them, but regex doesn't seem to work. Here is the regex I used:
(^}\n[a-zA-Z0-9]+{$)
Is there a way of identifying such patter using regex in python?
You can select the dictionary data based on named capture groups. As a bonus, this will also ignore any { or } within the extra chars.
The following pattern works on the provided data:
"\"id\"\:\"(?P<id>\d+?)\"[,\s]+\"name\"\:\"(?P<name>[ \w]+)\""
Example
import re
from pprint import pprint
string = \
"""
{"id":"121324343", "name":"foobar"}3$£_$£rvcfdd{}kgga£($(>..bu&^783 { "id":"343554353", "name":"ABC XYZ"}'
"""
pattern = re.compile(pattern="\"id\"\:\"(?P<id>\d+?)\"[,\s]+\"name\"\:\"(?P<name>[ \w]+)\"")
pprint([match.groupdict() for match in pattern.finditer(string=string)])
Output
[{'id': '121324343', 'name': 'foobar'}, {'id': '343554353', 'name': 'ABC XYZ'}]
Test it out yourself: https://regex101.com/r/82BqbE/1
Notes
For this example I assume the following:
id only contains integer digits.
name is a string that can contain the following characters [a-zA-Z0-9_ ]. (this includes white spaces and underscores).
Assuming the whole json is a single line, and there are no }{ inside the fields themselves, this should be enough
In [1]: import re
In [2]: x = """{"id":"121324343", "name":"foobar"}3$£_$£rvcfddkgga£($(>..bu&^783 { "id":"343554353", "name":"ABCXYZ"}"""
In [3]: print(re.sub(r'(?<=})[^}{]+(?={)', "\n", x))
{"id":"121324343", "name":"foobar"}
{ "id":"343554353", "name":"ABCXYZ"}
You can check the regex here https://regex101.com/r/leIoqE/1

Get JuliaDB.loadtable() to parse all columns as String

I want JuliaDB.loadtable() to read a CSV (really a bunch of CSVs, but for simplicity let's try just one), where all columns are parsed as String.
Here's what I've tried:
using CSV
using DataFrames
using JuliaDB
df1 = DataFrame(
[['a', 'b', 'c'], [1, 2, 3]],
["name", "id"]
)
CSV.write("df1.csv", df1)
# This works, but if I have 10+ columns it would get unwieldy
df1 = loadtable("df1.csv"; colparsers=Dict(:name=>String, :id=>String),)
# This doesn't work
df1 = loadtable("df1.csv"; colparsers=String,)
# MethodError: no method matching iterate(::Type{String})
Here's how it's done in R:
df1 = read.csv("df1.csv", colClasses = "character")
If you know the number of columns (or just an upper bound on it), you can use types, I should think (from CSV.jl documentation):
types: a Vector or Dict of types to be used for column types; a Dict can map column index Int, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector if provided, it must match the # of columns provided or detected in header

Logstash - Substring from CSV column

I want to import many informations from a CSV file to Elastic Search.
My issue is I don't how can I use a equivalent of substring to select information into a CSV column.
In my case I have a field date (YYYYMMDD) and I want to have (YYYY-MM-DD).
I use filter, mutate, gsub like:
filter
{
mutate
{
gsub => ["date", "[0123456789][0123456789][0123456789][0123456789][0123456789][0123456789][0123456789][0123456789]", "[0123456789][0123456789][0123456789][0123456789]-[0123456789][0123456789]-[0123456789][0123456789]"]
}
}
But my result is false.
I can indentified my string but I don't how can I extract part of this.
My target it's to have something like:
gsub => ["date", "[0123456789][0123456789][0123456789][0123456789][0123456789][0123456789][0123456789][0123456789]","%{date}(0..3}-%{date}(4..5)-%{date}"(6..7)]
%{date}(0..3} : select from the first to the 4 characters of csv columns date
You can use ruby plugin to do conversion. As you say, you will have a date field. So, we can use it directly in ruby
filter {
ruby {
code => "
date = Time.strptime(event['date'],'%Y%m%d')
event['date_new'] = date.strftime('%Y-%m-%d')
"
}
}
The date_new field is the format you want.
First, you can use a regexp range to match a sequence, so rather than [0123456789], you can do [0-9]. If you know there will be 4 numbers, you can do [0-9]{4}.
Second, you want to "capture" parts of your input string and reorder them in the output. For that, you need capture groups:
([0-9]{4})([0-9]{2})([0-9]{2})
where parens define the groups. Then you can reference those on the right side of your gsub:
\1-\2-\3
\1 is the first capture group, etc.
You might also consider getting these three fields when you do the grok{}, and then putting them together again later (perhaps with add_field).

Python csv writer, numpy array to csv

I have Python dict containing 4 key value pairs. Each value is a numpy arrays. Now I want to print the whole dict to a csv, forcing to write one numpy array per row.
with open(os.path.join("csv", title), 'w', newline='') as f:
w = csv.DictWriter(f, list(data.keys()))
w.writeheader()
w.writerow(data)
Is what I have used yet. But some of my arrays get written to several rows instead of a single line.
Here an example of input data:
{'DE': array([[ 38574. , 38538.1904, 39511.6190, 42521.1428,
50586. , 46282.5238, 42714.4761, 40612.0476],
[ 42798.4666, 42112.5333, 42277.8666, 42886.1333,
50224.3333, 48148.8 , 44272.6666, 41210.2 ]])}
I expect the output so that, each line of my array is written on one line. Instead I get a file containing "\n" after a certain amount of digits. how can i force to write the whole array in one row?
DE has a multidimensional array as its value, Inter has an empty list as its value, you end up with two columns one with Inter as the header with an empty list in its column and a second column DE with the array in its column which is exactly what the code should be doing.
If you want to alter each array length try setting numpy.set_printoptions:
numpy.set_printoptions(linewidth=1000)

Reading a string with commas gets automatically wrapped with quotes

I am parsing a tab delimited file and some of the columns contain commas in them.
Each column with a comma is getting wrapped with quotes , which breaks some code later on that tries to search for those values (without the quotes). For example if the original value in the csv file was "A, B, C" it is now stored as ""A, B, C"".
How can those extra quotes be removed/escaped automatically?
Thanks
The code I am using to read the file is:
genresMap = [:]
new File(runtime_dir + 'file.csv').eachLine {
def line -> column = line.split("\t");
genresMap[column[0].toString()] = column[1];
}