Removing special characters in Drupal Views CSV Data Export - csv

I have a CSV file, Data Export from Books and Chapters in Drupal 7.
After exporting to a CSV file, the body of the chapter shows an "Â ", or "Â " in the source code, for each " ", in the CSV file. The same thing happens for an apostrophe: "’" for each apostrophe (').
An example of the Body in my CSV file:
  This book will discuss Title I (General Requirements) and Title IV (Miscellaneous Provisions).  The FMLA was effective on August 5, 1993, six months after its passage.Â
The same text in Drupal:
This book will discuss Title I (General Requirements) and Title IV (Miscellaneous Provisions). The FMLA was effective on August 5, 1993, six months after its passage.
The same text in my Drupal wysiwyg source code:
This book will discuss Title I (General Requirements) and Title IV (Miscellaneous Provisions). The FMLA was effective on August 5, 1993, six months after its passage.
From what I can see, turns into "  " in the CSV file.
In the Data Export View, I have the Rewrite Results of Content: Body (Body) set to strip HTML Tags. Moreover, I have adding to preserve the tag and excluded as well.
This is probably a beginner error. Does anyone know how to remove or replace the results to show the proper text without the odd markup ("  ", )?

Related

ERP ( IFS ) export into CSV - coding problem

I'm exporting some data from the ERP system ( IFS ) into the CSV file. From that CSV it's being uploaded to another tool.
I have a problem with character coding. Until now we were pulling only Dannish and Finnish data and used the WE8MSWIN1252. Now we need to include also Polish signs. Unfortunately the coding that we have is not covering the special characters in Polish. I've tried already AL16UTF16, AL32UTF8, EEC8EUROASCI and none of them gave me the expected result ( having all of the Dannish, Finnish, Polish special signs visible correctly in the CSV). Is there any coding which would cover all ofthose signs right into the CSV? While ?I was opening the AL32UTF8 in notepad it worked fine, but we have to use the CSV due to the integration that is the next step in the puzzle.
Please note that changing the csv to anything else is really the last resort. We don't want to play with the integration that is going further.

Regex JSon Quote Cleaner

I've got some JSON coming in that is occasionally dirty with random quotes inserted. As an example :
"contact_interests": "Interests:|Poet - Several poems have been published. One poem was set to music, recorded and released in 1972. A recent poem |"Little Brother|" has been set to music and will be recorded and released by 2014.Read |"mystery books, love long walks/hikes, prayer, family.|",
We need to find and replace all occurrences of |" except for the case where it's those characters terminate a line (|",)
What would be the Regex to accomplish this? Thanks.
You can try this:
\|"(?=[^,])
It will match |" that isn't followed by a comma

creating a list from json csv file using python

I am sorry for asking this question, but i already look through but could not find the answer. I am honestly newbie.I am trying to generate a list of whole word from a json csv file. I already created a list of lines, but then i cannot use split() to generate new list containing separate word (later i need to count word occurrence).
My input file contains twitter information:
twitter data
i tried to write simple code:
myfile=open('fileName','r')
words=[]
for line in myfile:
words.append(line.split())
len(words)=82
I also tried reader=csv.reader(myFile) and reader=csv.DictReader(myFile)
but in all I can get each line, but how to further split the string/line into independent word. Sorry and thank you in advanced.
My data #I change to a different example as maybe last one was bad formatted:
id,flags,expiration,cas,value
493926581610364928,0,0,2635740904247446,"{""contributors"":null,""truncated"":false,""text"":""#xaaronh #blueredandgold If Namco Bandai's One Piece Unlimited World is anything to go by, no local retail release means no eShop either =\\"",""in_reply_to_status_id"":493925918998425600,""id"":493926581610364928,""favorite_count"":0,""source"":""Twitter Web Client"",""retweeted"":false,""coordinates"":null,""entities"":{""symbols"":[],""user_mentions"":[{""id"":139852376,""indices"":[0,8],""id_str"":""139852376"",""screen_name"":""xaaronh"",""name"":""Aaron""},{""id"":74393990,""indices"":[9,24],""id_str"":""74393990"",""screen_name"":""blueredandgold"",""name"":""Leigh""}],""hashtags"":[],""urls"":[]},""in_reply_to_screen_name"":""xaaronh"",""in_reply_to_user_id"":139852376,""retweet_count"":0,""id_str"":""493926581610364928"",""favorited"":false,""user"":{""follow_request_sent"":false,""profile_use_background_image"":true,""default_profile_image"":false,""id"":42302246,""profile_background_image_url_hp"":""hp://pbs.twimg.com/profile_background_images/464279459932020736/v1xnMcrV.jpeg"",""verified"":false,""profile_text_color"":""333333"",""profile_image_url_https"":""hp://pbs.twimg.com/profile_images/490791031487463424/udSldTQ3_normal.png"",""profile_sidebar_fill_color"":""DDEEF6"",""entities"":{""description"":{""urls"":[{""url"":""hp:tttt"",""indices"":[67,89],""expanded_url"":""hp://infernalmonkey.com"",""display_url"":""infernalmonkey.com""}]}},""followers_count"":506,""profile_sidebar_border_color"":""000000"",""id_str"":""42302246"",""profile_background_color"":""1A1B1F"",""listed_count"":22,""is_translation_enabled"":false,""utc_offset"":36000,""statuses_count"":8676,""description"":""I probably tweet about video games and onaholes. Let's be friends! (NSFW)"",""friends_count"":261,""location"":""Sydney, Australia"",""profile_link_color"":""2FC2EF"",""profile_image_url"":""hp://pbs.twimg.com/profile_images/490791031487463424/udSldTQ3_normal.png"",""following"":false,""geo_enabled"":false,""profile_banner_url"":""hp://pbs.twimg.com/profile_banners/42302246/1406105444"",""profile_background_image_url"":""hp://pbs.twimg.com/profile_background_images/464279459932020736/v1xnMcrV.jpeg"",""screen_name"":""infernal_monkey"",""lang"":""en"",""profile_background_tile"":false,""favourites_count"":2018,""name"":""Lance McGill"",""notifications"":false,""url"":null,""created_at"":""Sun May 24 23:20:25 +0000 2009"",""contributors_enabled"":false,""time_zone"":""Sydney"",""protected"":false,""default_profile"":false,""is_translator"":false},""geo"":null,""in_reply_to_user_id_str"":""139852376"",""lang"":""en"",""_id"":""493926581610364928"",""created_at"":""Tue Jul 29 01:10:48 +0000 2014"",""in_reply_to_status_id_str"":""493925918998425600"",""place"":null,""metadata"":{""iso_language_code"":""en"",""result_type"":""recent""}}"
This is not the best solution, just an effort from a noob (me), definitely need further editing for better output. I am using windows OS.
import csv
import json
abc=[]
myList=[]
myDict={}
myFile=open('fileName.csv','r',encoding='utf-8')
myReader=csv.reader(myFile)
header=next(myReader)
for line in myReader:
abc=json.loads(line[4])
myDict=abc
myList.append(myDict['text'])
dct={}
for eachLine in myList:
item=eachLine.split()
for one in item:
if one in dct:
dct[one]+=1
else:
dct[one]=1
finalList=list(dct.items())
finalList.sort()

Parsing csv file with vim

I have a large CSV file structured as follows:
CHINESE TRANSLATION
我去上学。 Wǒ qù shàngxué. I am going to school. 上 ♦ on, on top of ♦ go to
我去过北京。 Wǒ qùguò Běijīng. I've been to Beijing. 京 -- ♦ national capital ♦ Beijing
....
The TRANSLATION column blends together three different informations: the pinyin, the English translation and additional information. These three types of information are always present and always presented in the same way and separated by a dot.
What I want to achieve is to create three different columns from the TRANSLATION column, ie to get :
CHINESE PINYIN TRANSLATION ADDITIONAL
我去上学。 Wǒ qù shàngxué. I am going to school. 上 ♦ on, on top of ♦ go to
....
Using a vim macro, how can I do this ?
I think vim macros can handle this job, but executing a vim macro on a big file several thousand times is very slow. So if you just want your job done, I have just wrote a python script, and I think it could give you what you want.
import csv
# change 'in.csv' and 'out.csv'
# to your exact file names.
with open('in.csv', 'r') as infile:
with open('out.csv', 'w') as outfile:
csvreader = csv.reader(infile)
for a, b in csvreader:
line = a + ',' + ','.join(b.split('.'))
outfile.writelines(line)

Filter only in one element/array of Twitter JSON file

I crawled the twitter JSON file from Streaming API and got a file of thousands lines of JSON data. However, this data contains of lots of elements such as "creation date", "source", "tweet text", etc.
I actually want to filter the word "iphone" in the tweet text. However, if I filter using GREP UNIX, it filters out not only in the "tweet text" field but also in the "source" field. So it means that a tweet that does not contains word "iphone" but tweeted from Twitter for Iphone as stated in the "Source" field will also be filtered.
Is there anyway to filter this JSON only in one certain field (in my case it is "tweet text" field).
Here's the example of one JSON line:
{"created_at":"Tue Aug 20 03:48:27 +0000 2013","id":369667218608369666,"id_str":"369667218608369666","text":"#Mattyb_chyeah_ yeah I'm only watching him! :)","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":369666992334073856,"in_reply_to_status_id_str":"369666992334073856","in_reply_to_user_id":1557571363,"in_reply_to_user_id_str":"1557571363","in_reply_to_screen_name":"Mattyb_chyeah_","user":{"id":1325959333,"id_str":"1325959333","name":"MattyBRapsTexas","screen_name":"MattyBRapsTexas","location":"Atlanta,Georgia","url":"http:\/\/www.instagram.com\/mattybrapstexas","description":"3 RT 6 Mentions He followed me on 4\/15\/13 6\/17\/13 Maddi Jane followed me on 6\/18\/13 #8:25pm! Cimorelli also follows Pizza Hut mentioned me 2 times on 7\/26\/13","protected":false,"followers_count":1095,"friends_count":426,"listed_count":8,"created_at":"Thu Apr 04 02:34:56 +0000 2013","favourites_count":226,"utc_offset":-14400,"time_zone":"Eastern Time (US & Canada)","geo_enabled":false,"verified":false,"statuses_count":3447,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/378800000313651225\/afee0cc2286882eeb15f21ed7fae334a_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/378800000313651225\/afee0cc2286882eeb15f21ed7fae334a_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1325959333\/1376759786","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"Mattyb_chyeah_","name":"MattyB (\u2661_\u2661\u2740)","id":1557571363,"id_str":"1557571363","indices":[0,15]}]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"en"
What are you using for your grep regex? If you are just using 'iphone' for the regex then yes, you'll get multiple hits. You can expand your regex to match iphone only in text section before the source:
grep '"text":".*iphone.*","source":' myfile.txt
will search for the pattern iphone after "text" but before "source". It will ignore iphone in the rest of the line.