Reading a string with commas gets automatically wrapped with quotes - csv

I am parsing a tab delimited file and some of the columns contain commas in them.
Each column with a comma is getting wrapped with quotes , which breaks some code later on that tries to search for those values (without the quotes). For example if the original value in the csv file was "A, B, C" it is now stored as ""A, B, C"".
How can those extra quotes be removed/escaped automatically?
Thanks
The code I am using to read the file is:
genresMap = [:]
new File(runtime_dir + 'file.csv').eachLine {
def line -> column = line.split("\t");
genresMap[column[0].toString()] = column[1];
}

Related

rust-csv parse a string field wrapped in double quotes that contains newlines and uses double-quotes as escape character

I have a csv similar to this (original file is proprietary, cannot share). Separator is Tab.
It contains a description column, whose text is wrapped in double quotes, can contain quoted strings, where, wait for it, escape sequence is also double quote.
id description other_field
12 "Some Description" 34
56 "Some
Multiline
""With Escaped Stuff""
Description" 78
I am parsing the file with this code
let mut reader = csv::ReaderBuilder::new()
.from_reader(file)
.deserialize().unwrap();
I'm consistently getting CSV deserialize error :
CSV deserialize error: record 43747 (line: 43748, byte: 21082563): missing field 'id'
I tried using flexible(true), double_quotes(true) with no luck.
Is it possible to parse this type of field, and if yes, how ?
Actually the issue was unrelated, rust-serde perfectly parses this. Just forgot to define the delimiter (tab in this case). This code works :
let mut reader = csv::ReaderBuilder::new()
.delimiter(b'\t')
.from_reader(file)
.deserialize().unwrap();

How do I search for a string in this JSON with Python

My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.

PySpark write CSV quote all non-numeric

Is there a way to quote only non-numeric columns in the dataframe when output to CSV file using df.write.csv('path')?
I know you can use the option quoteAll=True to quote all the columns but I only want to quote the string columns.
I am using PySpark 2.2.0.
I only want to quote the string columns.
There is currently no parameter in write.csv that you can use to specify which columns to quote. However, one workaround is to modify your string columns by adding quotes around the values.
First identify the string columns by iterating over the dtypes
string_cols = [c for c, t in df.dtypes if t == "string"]
Now you can modify these columns by adding a quote as a prefix and suffix:
from pyspark.sql.functions import col, lit, concat
cols = [
concat(lit('"'), col(c), lit('"')) if c in string_cols else col(c)
for c in df.columns
]
df = df.select(*cols)
Finally write out the csv:
df.write.csv('path')

Removing \n \\n and other unwanted characters from a json unicode dictionary with python

I've tried a couple of different solutions to fix my problem with some "funny" newlines within my json dictionary and none of them works, so I thought I might make a post. The dictionary is achieved by scraping a website.
I have a json dictionary:
my_dict = {
u"Danish title": u"Avanceret",
u"Course type": u"MScTechnol",
u"Type of": u"assessmen",
u"Date": u"\nof exami",
u"Evaluation": u"7 step sca",
u"Learning objectives": u"\nA studen",
u"Participants restrictions": u"Minimum 10",
u"Aid": u"No Aid",
u"Duration of Course": u"13 weeks",
u"name": u"Advanced u",
u"Department": u"31\n",
u"Mandatory Prerequisites": u"31545",
u"General course objectives": u"\nThe cour",
u"Responsible": u"\nMartin C",
u"Location": u"Campus Lyn",
u"Scope and form": u"Lectures, ",
u"Point( ECTS )": u"10",
u"Language": u"English",
u"number": u"31548",
u"Content": u"\nThe cour",
u"Schedule": u"F4 (Tues 1"
}
I have stripped the value content to [:10] to reduce clutter, but some of the values have a length of 300 characters. It might not be portrayed well here, but some of values have a lot of newline characters in them and I've tried a lot of different solutions to remove them, such as str.strip and str.replace but without success because my 'values' are unicode. And by values I mean key, value in my_dict.items().
How do I remove all the newlines appearing in my dictionary? (With the values in focus as some of the newlines are trailing, some are leading and others are in the middle of the content: e.i \nI have a\ngood\n idea\n).
EDIT
I am using Python v. 2.7.11 and the following piece of code doesn't produce what I need. I want all the newlines to be changed to a single whitespace character.
for key, value in test.items():
value = str(value[:10]).replace("\n", " ")
print key, value
If you're trying to remove all \n or any junk character apart from numbers or letters then use regex
for key in my_dict.keys():
my_dict[key] = mydict[key].replace('\\n', '')
my_dict[key] = re.sub('[^A-Za-z0-9 ]+', '', my_dict[key])
print my_dict
If you wish to keep anything apart from those then add it on to the character class inside the regex
for remove '\n' try this ....
for key, value in my_dict.items():
my_dict[key] = ''.join(value.split('\n'))
you need to put the updated value back to your dictionary ( similar to "by value vs. by reference" situation ;) ) ...
to remove the "/n" this one liner may be more "pythonic" :
new_test ={ k:v.replace("\n", "") for k,v in test.iteritems()}
to do what you try to do in your loop try something like:
new_test ={ k:str(value[:10]).replace("\n", " ") for k,v in test.iteritems()}
In your code, value takes the new value, but you never write it back...
So for example, this would work (but be slower, also you would be changing the values inside the loop, which should not cause problems, but the interpreter might not like...):
for key, value in test.items():
value = str(value[:10]).replace("\n", " ")
#now put it back to the dictionary...
test[key]=value
print key, value

hive: create table / data type syntax for comma separated files

The text file is comma separated. However, one of the columns ex: "Issue" with value "Other (phone, health club, etc)" also contains commas.
Question: What should the data type of "Issue" be? And how should I format the table (row format delimited terminated by) so that the comma in the column (issue) is accounted for correctly
I had set it this way:
create table consumercomplaints (ComplaintID int,
Product string,
Subproduct string,
Issue string,
Subissue string,
State string,
ZIPcode int,
Submittedvia string,
Datereceived string,
Datesenttocompany string,
Company string,
Companyresponse string,
Timelyresponse string,
Consumerdisputed string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
location '/user/hive/warehouse/mydb/consumer_complaints.csv';
Sample data --
Complaint ID,Product,Sub-product,Issue,Sub-issue,State,ZIP code,Submitted via,Date received,Date sent to company,Company,Company response,Timely response?,Consumer disputed?
943291,Debt collection,,Cont'd attempts collect debt not owed,Debt is not mine,MO,63123,Web,07/18/2014,07/18/2014,"Enhanced Recovery Company, LLC",Closed with non-monetary relief,Yes,
943698,Bank account or service,Checking account,Deposits and withdrawals,,CA,93030,Web,07/18/2014,07/18/2014,U.S. Bancorp,In progress,Yes,
943521,Debt collection,,Cont'd attempts collect debt not owed,Debt is not mine,OH,44116,Web,07/18/2014,07/18/2014,"Vital Solutions, Inc.",Closed with explanation,Yes,
943400,Debt collection,"Other (phone, health club, etc.)",Communication tactics,Frequent or repeated calls,MD,21133,Web,07/18/2014,07/18/2014,"The CBE Group, Inc.",Closed with explanation,Yes,
I think you need to format your output data by some control character like Control-A. I don't think there will be any data type to support this. OR you can write a UDF to load the data and take care of formatting in the UDF logic.
Short of writing a serde, you could do 2 things,
escape the comma in the original data before loading, using some character. for e.g. \
and then use the hive create table command using row format delimited fields terminated by ',' escaped by **'\'**
you can use a regex that takes care of the comma enclosed within double quotes,
so first you apply a regex to data as shown in hortonworks/apache manuals,
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id source:https://web.archive.org/web/20171125014202/https://hortonworks.com/tutorial/how-to-process-data-with-apache-hive/
Ensure that you are able to load and see your data using this expression ( barring the enclosed commas).
Then modify the expression to account for enclosed commas. You can do something like this,
String s = "a,\"hi, I am here\",c,d,\"ahoy, mateys\"";
String pattern ="^(?:([^\",]*|\"[^\"]*\"),?){4}";
p = Pattern.compile(pattern);
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println("YES-"+m.groupCount());
System.out.println("=>"+m.group(1));
}
by changing {4} to {1}, {2}, ... you can get respective fields.