Julia - Rewriting a CSV - csv

Complete Julia newbie here.
I'd like to do some processing on a CSV. Something along the lines of:
using CSV
in_file = CSV.Source('/dir/in.csv')
out_file = CSV.Sink('/dir/out.csv')
for line in CSV.eachline(in_file)
replace!(line, "None", "")
CSV.writeline(out_file, line)
end
This is in pseudocode, those aren't existing functions.
Idiomatically, should I iterate on 1:CSV.countlines(in_file)? Do a while and check something?

If all you want to do is replace a string in the line, you do not need any CSV parsing utilities. All you do is read the file line by line, replace, and write. So:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
out = open(outfile, "w+")
for line in readlines(infile)
newline = replace(line, "a", "b")
write(out, newline)
end
close(out)
This will replicate the pseudocode you have in your question.
If you need to parse and read the csv field by field, use the readcsv function in base.
data=readcsv(infile)
typeof(data) #Array{Any,2}
This will return the data in the file as a 2 dimensional array. You can process this data any way you want, and write it back using the writecsv function.
for i in 1:size(data,1) #iterate by rows
data[i, 1] = "This is " * data[i, 1] # Add text to first column
end
writecsv(outfile, data)
Documentation for these functions:
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.readcsv
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.writecsv

Related

Is there a way to programmatically set a dataset's schema from a .csv

As an example, I have a .csv which uses the Excel dialect which uses something like Python's csv module doubleQuote to escape quotes.
For example, consider the row below:
"XX ""YYYYYYYY"", ZZZZZZ ""QQQQQQ""","JJJJ ""MMMM"", RRRR ""TTTT""",1234,RRRR,60,50
I would want the schema to then become:
[
'XX "YYYYYYYY", ZZZZZZ "QQQQQQ"',
'JJJJ "MMMM", RRRR "TTTT"',
1234,
'RRRR',
60,
50
]
Is there a way to set the schema of a dataset in a programmatic/automated fashion?
While you can do this in code, foundrys dataset-app can also do this natively. This means you can skip writing the code (which is nice) but also means you can potentially save a step in your pipeline (which might save you on runtime.)
After uploading the files to a dataset, press "edit schema" on the dataset:
Then apply settings like the following, which would result in the desired outcome in your case:
Then press "save and validate" and the dataset should end up with the correct schema:
Starting with this example:
Dataset<Row> dataset = files
.sparkSession()
.read()
.option("inferSchema", "true")
.csv(csvDataset);
output.getDataFrameWriter(dataset).write();
Add the header, quote, and escape options, like so:
Dataset<Row> dataset = files
.sparkSession()
.read()
.option("inferSchema", "true")
.option("header", "true")
.option("quote", "\"")
.option("escape", "\"")
.csv(csvDataset);
output.getDataFrameWriter(dataset).write();

Python 3 psycopg2 COPY from stdin failed: error in .read()

I am trying to apply the code found on this page, in particular part 'Copy Data from String Iterator' of the Table of Contents, but run into an issue with my code.
Since not all lines coming from the generator (here log_lines) can be imported into the PostgreSQL database, I try to filter the correct lines (here row) using itertools.filterfalse like in the codeblock below:
def copy_string_iterator(connection, log_lines) -> None:
with connection.cursor() as cursor:
create_staging_table(cursor)
log_string_iterator = StringIteratorIO((
'|'.join(map(clean_csv_value, (
row['date'],
row['time'],
row['cs_uri_query'],
row['s_contentpath'],
row['sc_status'],
row['s_computername'],
...
row['sc_substates'],
row['s_port'],
row['cs_version'],
row['c_protocol'],
row.update({'cs_cookie':'x'}),
row['timetakenms'],
row['cs_uri_stem'],
))) + '\n')
for row in filterfalse(lambda line: "#" in line.get('date'), log_lines)
)
cursor.copy_from(log_string_iterator, 'log_table', sep = '|')
When I run this, cursor.copy_from() gives me the following error:
QueryCanceled: COPY from stdin failed: error in .read() call
CONTEXT: COPY log_table, line 112910
I understand why this error happens, it is because in the test file I use there are only 112909 lines that meet the filterfalse condition. But why does it try to copy line 112910 and throw the error and not just stop?
Since Python doesn't have a coalescing operator, add something like:
(map(clean_csv_value, (
row['date'] if 'date' in row else None,
:
row['cs_uri_stem'] if 'cs_uri_stem' in row else None,
))) + '\n')
for each of your fields so you can handle any missing fields in the JSON file. Of course the fields should be nullable in the db if you use None otherwise replace with None with some default value for that field.

blank file while copying a file in python

I have a function takes a file as input and prints certain statistics and also copies the file into a file name provided by the user. Here is my current code:
def copy_file(option):
infile_name = input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in infile:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
It prints the statistics of the file correctly, however the copy it makes of the file is empty. If I remove the readline() part and the statistics part, the function seems to make a copy of the file correctly. How can I correct my code so that it does both. It's a minor problem but I can't seem to get it.
The reason the file is blank is that
slist = infile.readlines()
is reading the entire contents of the file, so when it gets to
for line in infile:
there is nothing left to read and it just closes the newly truncated (mode w) file leaving you with a blank file.
I think the answer here is to change your for line in infile: to for line in slist:
def copy_file(option):
infile_name= input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in slist:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
Having said all that, consider if it's worth using your own copy routine rather than shutil.copy - Always better to delegate the task to your OS as it will be quicker and probably safer (thanks to NightShadeQueen for the reminder)!

Read An Input.md file and output a .html file Haskell

I had a question concerning some basic transformations in Haskell.
Basically, I have a written Input file, named Input.md. This contains some markdown text that is read in my project file, and I want to write a few functions to do transformations on the text. After completing these functions under a function called convertToHTML, I have output the file as an .html file in the correct format.
module Main
(
convertToHTML,
main
) where
import System.Environment (getArgs)
import System.IO
import Data.Char (toLower, toUpper)
process :: String -> String
process s = head $ lines s
convertToHTML :: String -> String
convertToHTML str = do
x <- str
if (x == '#')
then "<h1>"
else return x
--convertToHTML x = map toUpper x
main = do
args <- getArgs -- command line args
let (infile,outfile) = (\(x:y:ys)->(x,y)) args
putStrLn $ "Input file: " ++ infile
putStrLn $ "Output file: " ++ outfile
contents <- readFile infile
writeFile outfile $ convertToHTML contents
So,
How would I read through my input file, and transform any line that starts with a # to an html tag
How would I read through my input file once more and transform any WORD that is surrounded by _word_ (1 underscore) to another html tag
Replace any Character with an html string.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text. Please if anybody has any suggestions. I've been working on this for 2 days straight and have a bunch of failed code to show for a couple of weeks and have a bunch of failed code to show it.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text.
Because they work on appropriate element collection. And they don't really "iterate"; you simply have to feed the appropriate data. Let's tackle the # problem as an example.
Our file is one giant String, and what we'd like is to have it nicely split in lines, so [String]. What could do it for us? I have no idea, so let's just search Hoogle for String -> [String].
Ah, there we go, lines function! Its counterpart, unlines, is also going to be useful. Now we can write our line wrapper:
convertHeader :: String -> String
convertHeader [] = [] -- that prevents us from calling head on an empty line
convertHeader x = if head x == '#' then "<h1>" ++ x ++ "</h1>"
else x
and so:
convertHeaders :: String -> String
convertHeaders = unlines . map convertHeader . lines
-- ^String ^[String] ^[String] ^String
As you can see the function first converts the file to lines, maps convertHeader on each line, and the puts the file back together.
See it live on Ideone
Try now doing the same with words to replace your formatting patterns. As a bonus exercise, change convertHeader to count the number of # in front of the line and output <h1>, <h2>, <h3> and so on accordingly.

Encapsulate complex mysql inserts with back quotes

I have exported some sql statements, and now need to replay them. The issue arises when I play this to mysql where actual java code is included in the VALUES. This code includes back slashes, quotes, double quotes - sometimes causing mysql to mis-interpret escaping and quoting:
INSERT INTO sonar.snapshot_sources(ID, SNAPSHOT_ID, DATA) VALUES (267, 420,
'package com.company.gateway.dl.util
"((\''|\")*)(stuff|moreStuff)((\''|\")*):((\''|\")*)([0-9]+)((\''|\")*)";
');
Above is not the full text, but the quoted bit here, cause the insert to fail. A work around is to enclose the complex text value in back quotes, e.g.
INSERT INTO sonar.snapshot_sources(ID, SNAPSHOT_ID, DATA) VALUES (267, 420,
`'package com.company.gateway.dl.util
"((\''|\")*)(stuff|moreStuff)((\''|\")*):((\''|\")*)([0-9]+)((\''|\")*)";
'`);
The question I have is, how can I sed precisely these statements to add back quotes to encapsulate where required? I want to replace 'package with `'package and '); with '`); The closing bracket seems the most complicated since many more statements will match this.
The below python script can help you. The script will read the sql dump file and add backtick to the matched regex and prints it to output file. Input file and output file should be given as a commandline argument.
#!/usr/bin/env python3
import re
import os
import sys
if len(sys.argv) < 3:
print('Provide input and output file to begin processing')
sys.exit(1)
else:
if not os.path.isfile(sys.argv[1]):
print('Input file does not exit')
sys.exit(1)
ifile = sys.argv[1]
ofile = sys.argv[2]
# Compiling the regex
rex = re.compile(r"""(INSERT[\s\S]+?)('package[\s\S]+?')(\);)""")
# Reading the file as a string
dataFile = open(ifile, 'r').read()
fp = open(ofile, 'w')
# Substituting it
print(re.sub(rex, r'\1`\2`\3', dataFile), end='', file=fp)
fp.close()
Output
INSERT INTO sonar.snapshot_sources(ID, SNAPSHOT_ID, DATA) VALUES (267, 420,
`'package com.company.gateway.dl.util
"((\''|\")*)(stuff|moreStuff)((\''|\")*):((\''|\")*)([0-9]+)((\''|\")*)";
'`);