CSV Parsing Issue with Attoparsec - csv

Here is my code that does CSV parsing, using the text and attoparsec
libraries:
import qualified Data.Attoparsec.Text as A
import qualified Data.Text as T
-- | Parse a field of a record.
field :: A.Parser T.Text -- ^ parser
field = fmap T.concat quoted <|> normal A.<?> "field"
where
normal = A.takeWhile (A.notInClass "\n\r,\"") A.<?> "normal field"
quoted = A.char '"' *> many between <* A.char '"' A.<?> "quoted field"
between = A.takeWhile1 (/= '"') <|> (A.string "\"\"" *> pure "\"")
-- | Parse a block of text into a CSV table.
comma :: T.Text -- ^ CSV text
-> Either String [[T.Text]] -- ^ error | table
comma text
| T.null text = Right []
| otherwise = A.parseOnly table text
where
table = A.sepBy1 record A.endOfLine A.<?> "table"
record = A.sepBy1 field (A.char ',') A.<?> "record"
This works well for a variety of inputs but is not working in case that there
is a trailing \n at the end of the input.
Current behaviour:
> comma "hello\nworld"
Right [["hello"],["world"]]
> comma "hello\nworld\n"
Right [["hello"],["world"],[""]]
Wanted behaviour:
> comma "hello\nworld"
Right [["hello"],["world"]]
> comma "hello\nworld\n"
Right [["hello"],["world"]]
I have been trying to fix this issue but I ran out of idaes. I am almost
certain that it will have to be something with A.endOfInput as that is the
significant anchor and the only "bonus" information we have. Any ideas on how
to work that into the code?
One possible idea is to look at the end of the string before running the
Attoparsec parser and removing the last character (or two in case of \r\n)
but that seems to be a hacky solution that I would like avoid in my code.
Full code of the library can be found here: https://github.com/lovasko/comma

Related

postgresql json to columns error Character with value must be escaped

I try to load some data from a table containing json rows.
There is one field that can contain special chars as \t and \r, and I want to keep them as is in the new table.
Here is my file:
{"text_sample": "this is a\tsimple test", "number_sample": 4}
Here is what I do:
Drop table if exists temp_json;
Drop table if exists test;
create temporary table temp_json (values text);
copy temp_json from '/path/to/file';
create table test as (select
(values->>'text_sample') as text_sample,
(values->>'number_sample') as number_sample
from (
select replace(values,'\','\\')::json as values
from temp_json
) a);
I keep getting this error:
ERROR: invalid input syntax for type json
DETAIL: Character with value 0x09 must be escaped.
CONTEXT: JSON data, line 1: ...g] Objection to PDDRP Mediation (was Re: Call for...
How do I need to escape those characters?
Thanks a lot
Copy the file as csv with a different quoting character and delimiter:
drop table if exists test;
create table test (values jsonb);
\copy test from '/path/to/file.csv' with (format csv, quote '|', delimiter ';');
select values ->> 'text_sample', values ->> 'number_sample'
from test;
?column? | ?column?
-----------------------------+----------
this is a simple test | 4
As mentioned in Andrew Dunstan's PostgreSQL and Technical blog
In text mode, COPY will be simply defeated by the presence of a backslash in the JSON. So, for example, any field that contains an embedded double quote mark, or an embedded newline, or anything else that needs escaping according to the JSON spec, will cause failure. And in text mode you have very little control over how it works - you can't, for example, specify a different ESCAPE character. So text mode simply won't work.
so we have to turn around to the CSV format mode.
copy the_table(jsonfield)
from '/path/to/jsondata'
csv quote e'\x01' delimiter e'\x02';
In the official document sql-copy, some Parameters list here:
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHERE condition ]
where option can be one of:
FORMAT format_name
FREEZE [ boolean ]
DELIMITER 'delimiter_character'
NULL 'null_string'
HEADER [ boolean ]
QUOTE 'quote_character'
ESCAPE 'escape_character'
FORCE_QUOTE { ( column_name [, ...] ) | * }
FORCE_NOT_NULL ( column_name [, ...] )
FORCE_NULL ( column_name [, ...] )
ENCODING 'encoding_name'
FORMAT
Selects the data format to be read or written: text, csv (Comma Separated Values), or binary. The default is text.
QUOTE
Specifies the quoting character to be used when a data value is quoted. The default is double-quote. This must be a single one-byte character. This option is allowed only when using CSV format.
DELIMITER
Specifies the character that separates columns within each row (line) of the file. The default is a tab character in text format, a comma in CSV format. This must be a single one-byte character. This option is not allowed when using binary format.
NULL
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format. You might prefer an empty string even in text format for cases where you don't want to distinguish nulls from empty strings. This option is not allowed when using binary format.
HEADER
Specifies that the file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.
cast json as text, instead of getting text value from json. Eg:
t=# with j as (
select '{"text_sample": "this is a\tsimple test", "number_sample": 4}'::json v
)
select v->>'text_sample' your, (v->'text_sample')::text better
from j;
your | better
-----------------------------+--------------------------
this is a simple test | "this is a\tsimple test"
(1 row)
and to avoid 0x09 error, try using
replace(values,chr(9),'\t')
as in your example you replace backslash+t, not the actual chr(9)...

Julia - Rewriting a CSV

Complete Julia newbie here.
I'd like to do some processing on a CSV. Something along the lines of:
using CSV
in_file = CSV.Source('/dir/in.csv')
out_file = CSV.Sink('/dir/out.csv')
for line in CSV.eachline(in_file)
replace!(line, "None", "")
CSV.writeline(out_file, line)
end
This is in pseudocode, those aren't existing functions.
Idiomatically, should I iterate on 1:CSV.countlines(in_file)? Do a while and check something?
If all you want to do is replace a string in the line, you do not need any CSV parsing utilities. All you do is read the file line by line, replace, and write. So:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
out = open(outfile, "w+")
for line in readlines(infile)
newline = replace(line, "a", "b")
write(out, newline)
end
close(out)
This will replicate the pseudocode you have in your question.
If you need to parse and read the csv field by field, use the readcsv function in base.
data=readcsv(infile)
typeof(data) #Array{Any,2}
This will return the data in the file as a 2 dimensional array. You can process this data any way you want, and write it back using the writecsv function.
for i in 1:size(data,1) #iterate by rows
data[i, 1] = "This is " * data[i, 1] # Add text to first column
end
writecsv(outfile, data)
Documentation for these functions:
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.readcsv
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.writecsv

HIVE escaped by not working '\\'

I have a data-set in S3
123, "some random, text", "", "", 236
I build a external table on this dataset :
CREATE EXTERNAL TABLE db1.myData(
field1 bigint,
field2 string,
field3 string,
field4 string,
field5 bigint,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LOCATION 's3n://thisMyData/';
Problem/ Issue :
when I do
select * from db1.myData
field2 is shown as
some random
I need the field to be
some random, text
Gotcha's :
1. I cannot change the delimiter as there are over ~300 .csv files at this location
2. ESCAPED BY is not escaping the '\\'
3. I'm using HIVE 0.13 so there I cannot use CSV SerDe and neither i'm allowed to import new jars to cluster (its a complicated process to add a new jar as I have to go through Director level approvals)
Question:
Is there a workaround for making 'ESCAPED BY' come alive ?!
Any other workarounds for this ??
All suggestions are welcome !!
N.B : THis is not a repeat question. If you think its a repeat, please guide me to right page and I will take this off of this portal :)
I had to use: ESCAPED BY '\134' which translates to: ESCAPED BY '\'.
Additionally, because I was calling the Athena create table statement by passing in the statement from a JSON file I had to add an extra \ to mask the original \ in JSON. So my final statement within the JSON file looked like this: ESCAPED BY '\\134'.
If you are using Hive 0.14, you can use CSV Serde like this:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Refer below link for details:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Json Files parsing

So I am trying to open some json files to look for a publication year and sort them accordingly. But before doing this, I decided to experiment on a single file. I am having trouble though, because although I can get the files and the strings, when I try to print one word, it starts printinf the characters.
For example:
print data2[1] #prints
THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine. #results
but now
print data2[1][0] #should print THE
T #prints T
This is my code right now:
json_data =open(path)
data = json.load(json_data)
i=0
data2 = []
for x in range(0,len(data)):
data2.append(data[x]['section'])
if len(data[x]['content']) > 0:
for i in range(0,len(data[x]['content'])):
data2.append(data[x]['content'][i])
I probably need to look at your json file to be absolutely sure, but it seems to me that the data2 list is a list of strings. Thus, data2[1] is a string. When you do data2[1][0], the expected result is what you are getting - the character at the 0th index in the string.
>>> data2[1]
'THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine.'
>>> data2[1][0]
'T'
To get the first word, naively, you can split the string by spaces
>>> data2[1].split()
['THE', 'BRIDES', 'ORNAMENTS,', 'Viz.', 'Fiue', 'MEDITATIONS,', 'Morall', 'and', 'Diuine.']
>>> data2[1].split()[0]
'THE'
However, this will cause issues with punctuation, so you probably need to tokenize the text. This link should help - http://www.nltk.org/_modules/nltk/tokenize.html

Read An Input.md file and output a .html file Haskell

I had a question concerning some basic transformations in Haskell.
Basically, I have a written Input file, named Input.md. This contains some markdown text that is read in my project file, and I want to write a few functions to do transformations on the text. After completing these functions under a function called convertToHTML, I have output the file as an .html file in the correct format.
module Main
(
convertToHTML,
main
) where
import System.Environment (getArgs)
import System.IO
import Data.Char (toLower, toUpper)
process :: String -> String
process s = head $ lines s
convertToHTML :: String -> String
convertToHTML str = do
x <- str
if (x == '#')
then "<h1>"
else return x
--convertToHTML x = map toUpper x
main = do
args <- getArgs -- command line args
let (infile,outfile) = (\(x:y:ys)->(x,y)) args
putStrLn $ "Input file: " ++ infile
putStrLn $ "Output file: " ++ outfile
contents <- readFile infile
writeFile outfile $ convertToHTML contents
So,
How would I read through my input file, and transform any line that starts with a # to an html tag
How would I read through my input file once more and transform any WORD that is surrounded by _word_ (1 underscore) to another html tag
Replace any Character with an html string.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text. Please if anybody has any suggestions. I've been working on this for 2 days straight and have a bunch of failed code to show for a couple of weeks and have a bunch of failed code to show it.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text.
Because they work on appropriate element collection. And they don't really "iterate"; you simply have to feed the appropriate data. Let's tackle the # problem as an example.
Our file is one giant String, and what we'd like is to have it nicely split in lines, so [String]. What could do it for us? I have no idea, so let's just search Hoogle for String -> [String].
Ah, there we go, lines function! Its counterpart, unlines, is also going to be useful. Now we can write our line wrapper:
convertHeader :: String -> String
convertHeader [] = [] -- that prevents us from calling head on an empty line
convertHeader x = if head x == '#' then "<h1>" ++ x ++ "</h1>"
else x
and so:
convertHeaders :: String -> String
convertHeaders = unlines . map convertHeader . lines
-- ^String ^[String] ^[String] ^String
As you can see the function first converts the file to lines, maps convertHeader on each line, and the puts the file back together.
See it live on Ideone
Try now doing the same with words to replace your formatting patterns. As a bonus exercise, change convertHeader to count the number of # in front of the line and output <h1>, <h2>, <h3> and so on accordingly.