Is there a way to programmatically set a dataset's schema from a .csv - palantir-foundry

As an example, I have a .csv which uses the Excel dialect which uses something like Python's csv module doubleQuote to escape quotes.
For example, consider the row below:
"XX ""YYYYYYYY"", ZZZZZZ ""QQQQQQ""","JJJJ ""MMMM"", RRRR ""TTTT""",1234,RRRR,60,50
I would want the schema to then become:
[
'XX "YYYYYYYY", ZZZZZZ "QQQQQQ"',
'JJJJ "MMMM", RRRR "TTTT"',
1234,
'RRRR',
60,
50
]
Is there a way to set the schema of a dataset in a programmatic/automated fashion?

While you can do this in code, foundrys dataset-app can also do this natively. This means you can skip writing the code (which is nice) but also means you can potentially save a step in your pipeline (which might save you on runtime.)
After uploading the files to a dataset, press "edit schema" on the dataset:
Then apply settings like the following, which would result in the desired outcome in your case:
Then press "save and validate" and the dataset should end up with the correct schema:

Starting with this example:
Dataset<Row> dataset = files
.sparkSession()
.read()
.option("inferSchema", "true")
.csv(csvDataset);
output.getDataFrameWriter(dataset).write();
Add the header, quote, and escape options, like so:
Dataset<Row> dataset = files
.sparkSession()
.read()
.option("inferSchema", "true")
.option("header", "true")
.option("quote", "\"")
.option("escape", "\"")
.csv(csvDataset);
output.getDataFrameWriter(dataset).write();

Related

Add a new line in front of each line before writing to JSON format using Spark in Scala

I'd like to add one new line in front of each of my json document before Spark writes it into my s3 bucket:
df.createOrReplaceTempView("ParquetTable")
val parkSQL = spark.sql("select LAST_MODIFIED_BY, LAST_MODIFIED_DATE, NVL(CLASS_NAME, className) as CLASS_NAME, DECISION, TASK_TYPE_ID from ParquetTable")
parkSQL.show(false)
parkSQL.count()
parkSQL.write.json("s3://test-bucket/json-output-7/")
with only this command, it'll produce files with contents below:
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
but, what I'd like to achieve is something like below:
{"index":{}}
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
Any insight on how to achieve this result would be greatly appreciated!
Below code will concat {"index":{}} with existing row data in DataFrame & It will convert data into json then save json data using text format.
df
.select(
lit("""{"index":{}}""").as("index"),
to_json(struct($"*")).as("json_data")
)
.select(
concat_ws(
"\n", // This will split index column & other column data into two lines.
$"index",
$"json_data"
).as("data")
)
.write
.format("text") // This is required.
.save("s3://test-bucket/json-output-7/")
Final Output
cat part-00000-24619b28-6501-4763-b3de-1a2f72a5a4ec-c000.txt
{"index":{}}
{"CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}

jq - How to extract domains and remove duplicates

Given the following json:
Full file here: https://pastebin.com/Hzt9bq2a
{
"name": "Visma Public",
"domains": [
"accountsettings.connect.identity.stagaws.visma.com",
"admin.stage.vismaonline.com",
"api.home.stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"app.workbox.co.uk",
"authz.workbox.dk",
"connect.identity.stagaws.visma.com",
"eaccounting.stage.vismaonline.com",
"eaccountingprinting.stage.vismaonline.com",
"http://myservices-api.stage.vismaonline.com/",
"identity.stage.vismaonline.com",
"myservices.stage.vismaonline.com"
]
}
How can I transform the data to the below. Which is, to identify the domains in the format of site.SLD.TLD present and then remove the duplication of them. (Not including the subdomains, protocols or paths as illustrated below.)
{
"name": "Visma Public",
"domains": [
"workbox.co.uk",
"workbox.dk",
"visma.com",
"vismaonline.com"
]
}
I would like to do so in jq as that is what I've used to wrangled the data into this format so far, but at this stage any solution that I can run on Debian (I'm using bash) without any extraneous tooling ideally would be fine.
I'm aware that regex can be used within jq so I assume the best way is to regex out the domain and then pipe to unique however I'm unable to get anything working so far I'm currently trying this version which seems to me to need only the text transformation stage adding in somehow either during the jq process or with a run over with something like awk after the event perhaps:
jq '[.[] | {name: .name, domain: [.domains[]] | unique}]' testfile.json
This appears to be useful: https://github.com/stedolan/jq/issues/537
One solution was offered which does a regex match to extract the last two strings separated by . and call the unique function on that & works up to a point but doesn't cover site.SLD.TLD that has 2 parts. Like google.co.uk would return only co.uk with this jq for example:
jq '.domains |= (map(capture("(?<x>[[:alpha:]]+).(?<z>[[:alpha:]]+)(.?)$") | join(".")) | unique)'
A programming language is much more expressive than jq.
Try the following snippet with python3.
import json
import pprint
import urllib.request
from urllib.parse import urlparse
import os
def get_tlds():
f = urllib.request.urlopen("https://publicsuffix.org/list/effective_tld_names.dat")
content = f.read()
lines = content.decode('utf-8').split("\n")
# remove comments
tlds = [line for line in lines if not line.startswith("//") and not line == ""]
return tlds
def extract_domain(url, tlds):
# get domain
url = url.replace("http://", "").replace("https://", "")
url = url.split("/")[0]
# get tld/sld
parts = url.split(".")
suffix1 = parts[-1]
sld1 = parts[-2]
if len(parts) > 2:
suffix2 = ".".join(parts[-2:])
sld2 = parts[-3]
else:
suffix2 = suffix1
sld2 = sld1
# try the longger first
if suffix2 in tlds:
tld = suffix2
sld = sld2
else:
tld = suffix1
sld = sld1
return sld + "." + tld
def clean(site, tlds):
site["domains"] = list(set([extract_domain(url, tlds) for url in site["domains"]]))
return site
if __name__ == "__main__":
filename = "Hzt9bq2a.json"
cache_path = "tlds.json"
if os.path.exists(cache_path):
with open(cache_path, "r") as f:
tlds = json.load(f)
else:
tlds = get_tlds()
with open(cache_path, "w") as f:
json.dump(tlds, f)
with open(filename) as f:
d = json.load(f)
d = [clean(site, tlds) for site in d]
pprint.pprint(d)
with open("clean.json", "w") as f:
json.dump(d, f)
May I offer you achieving the same query with jtc: the same could be achieved in other languages (and of course in jq) - the query is mostly how to come up with the regex to satisfy your ask:
bash $ <file.json jtc -w'<domains>l:>((?:[a-z0-9]+\.)?[a-z0-9]+\.[a-z0-9]+)[^.]*$<R:' -u'{{$1}}' /\
-ppw'<domains>l:><q:' -w'[domains]:<[]>j:' -w'<name>l:'
{
"domains": [
"stagaws.visma.com",
"stage.vismaonline.com",
"stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"workbox.co.uk",
"authz.workbox.dk"
],
"name": "Visma Public"
}
bash $
Note: it does extract only DOMAIN.TLD, as per your ask. If you like to extract DOMAIN.SLD.TLD, then the task becomes a bit less trivial.
Update:
Modified solution as per the comment: extract domain.sld.tld where 3 or more levels and domain.tld where there’s only 2
PS. I'm the creator of the jtc - JSON processing utility. This disclaimer is SO requirement.
One of the solutions presented on this page offers that:
A programming language is much more expressive than jq.
It may therefore be worthwhile pointing out that jq is an expressive, Turing-complete programming language, and that it would be as straightforward (and as tedious) to capture all the intricacies of the "Public Suffix List" using jq as any other programming language that does not already provide support for this list.
It may be useful to illustrate an approach to the problem that passes the (revised) test presented in the Q. This approach could easily be extended in any one of a number of ways:
def extract:
sub("^[^:]*://";"")
| sub("/.*$";"")
| split(".")
| (if (.[-1]|length) == 2 and (.[-2]|length) <= 3
then -3 else -2 end) as $ix
| .[$ix : ]
| join(".") ;
{name, domain: (.domains | map(extract) | unique)}
Output
{
"name": "Visma Public",
"domain": [
"visma.com",
"vismaonline.com",
"workbox.co.uk",
"workbox.dk"
]
}
Judging from your example, you don't actually want top-level domains (just one component, e.g. ".com"), and you probably don't really want second-level domains (last two components) either, because some domain registries don't operate at the TLD level. Given www.foo.com.br, you presumably want to find out about foo.com.br, not com.br.
To do that, you need to consult the Public Suffix List. The file format isn't too complicated, but it has support for wildcards and exceptions. I dare say that jq isn't the ideal language to use here — pick one that has a URL-parsing module (for extracting hostnames) and an existing Public Suffix List module (for extracting the domain parts from those hostnames).

CSV Reader works, Trouble with CSV writer

I am writing a very simple python script to READ a CSV (no problem) and to write to another CSV (issue):
System info:
Windows 10
Powershell
Python 3.6.5 :: Anaconda, Inc.
Sample Data: Office Events
The purpose is to filter events based on criteria, and to write to another CSV with desired criteria.
For Example:
I would like to read from this CSV and write the events where Registrations (or column 4) is Greater than 0 (remove rows with registrations = 0)
# SCRIPT TO FILTER EVENTS TO BE PROCESSED
import os
import time
import shutil
import os.path
import fnmatch
import csv
import glob
import pandas
# Location of file containing ALL events
path = r'allEvents.csv'
# Writes to writer
writer = csv.writer(open(r'RegisteredEvents' + time.strftime("%m_%d_%Y-%I_%M_%S") + '.csv', "wb"))
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
#writer.writerow([r'Event Name', r'Start Date', r'End Date', r'Registrations', r'Total Revenue', r'ID', r'Status'])
#writer.writerow([b'Event Name', b'Start Date', b'End Date', b'Registrations', b'Total Revenue', b'ID', b'Status'])
def checkRegistrations(file):
reader = csv.reader(file)
data = list(reader)
for row in data:
#if row[3] > str(0):
if row[3] > int(0):
writer.writerow(([data]))
The Error I continue to get is:
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
TypeError: a bytes-like object is required, not 'str'
I have tried using the various commented out statements
For Example:
"" vs r"" vs r'' vs b''
if row[3] > int(0) **vs** if row[3] > str(0)
Every time I execute my script, It creates the file.. so the first csv writer line works (create and open the file)... the second line (to write the headers) is when the error appears...
Perhaps I am getting mixed up with syntax due to python versions, or perhaps I am misusing the CSV library, or (more than likely) I have endless to learn about data type IO and conversion... someone please help!!
I am aware of the excess of import libraries -- script came from another basic script to move files from one location to another based on filename and output a rowcounter for each file being moved.
With that being said, I may be unaware of any missing/ needed libraries
Please let me know if you have any questions, concerns or clarifications
Thanks in advance!
It looks like you are calling:
writer = csv.writer(open('file.csv', 'wb'))
The 'wb' argument is the file mode. The 'b' means that you are opening the file that you are writing to in binary mode. You are then trying to write a string which isn't what it is expecting.
Try getting rid of the 'b' in the 'wb'.
writer = csv.writer(open('file.csv', 'w'))
Let me know if that works for you.

Formatting json file in Python

counter={"a":1,"b":2}
With open('egg.json' , 'w') as json_file:
json.dump(counter, json_file)
So when I review my json file, it shows this:
{a:1 , b:2}
But I need it to be something like this:
[ [a:1], [b:2] ]
I've already tried adding
json.dump(counter, json_file, separator (' [ ', ' ] ')
But nothing will do the trick...
Is there a way to format the json file like the way you can format a CSV file?
I'd really like to know..... Thanks.
[a:1], [b:2] isn't valid json, so using the json module won't help you here.
If for some reason you want a formatted string output, you could instead do the following (don't call the file egg.json since it won't be valid json!):
counter = {'a':1, 'b':2}
output = []
for k, v in sorted(counter.items()):
output.append('[{}:{}]'.format(k, v))
with open('egg.txt', 'w') as txt_f:
txt_f.write(', '.join(output))

Julia - Rewriting a CSV

Complete Julia newbie here.
I'd like to do some processing on a CSV. Something along the lines of:
using CSV
in_file = CSV.Source('/dir/in.csv')
out_file = CSV.Sink('/dir/out.csv')
for line in CSV.eachline(in_file)
replace!(line, "None", "")
CSV.writeline(out_file, line)
end
This is in pseudocode, those aren't existing functions.
Idiomatically, should I iterate on 1:CSV.countlines(in_file)? Do a while and check something?
If all you want to do is replace a string in the line, you do not need any CSV parsing utilities. All you do is read the file line by line, replace, and write. So:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
out = open(outfile, "w+")
for line in readlines(infile)
newline = replace(line, "a", "b")
write(out, newline)
end
close(out)
This will replicate the pseudocode you have in your question.
If you need to parse and read the csv field by field, use the readcsv function in base.
data=readcsv(infile)
typeof(data) #Array{Any,2}
This will return the data in the file as a 2 dimensional array. You can process this data any way you want, and write it back using the writecsv function.
for i in 1:size(data,1) #iterate by rows
data[i, 1] = "This is " * data[i, 1] # Add text to first column
end
writecsv(outfile, data)
Documentation for these functions:
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.readcsv
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.writecsv