I am sorry for asking this question, but i already look through but could not find the answer. I am honestly newbie.I am trying to generate a list of whole word from a json csv file. I already created a list of lines, but then i cannot use split() to generate new list containing separate word (later i need to count word occurrence).
My input file contains twitter information:
twitter data
i tried to write simple code:
myfile=open('fileName','r')
words=[]
for line in myfile:
words.append(line.split())
len(words)=82
I also tried reader=csv.reader(myFile) and reader=csv.DictReader(myFile)
but in all I can get each line, but how to further split the string/line into independent word. Sorry and thank you in advanced.
My data #I change to a different example as maybe last one was bad formatted:
id,flags,expiration,cas,value
493926581610364928,0,0,2635740904247446,"{""contributors"":null,""truncated"":false,""text"":""#xaaronh #blueredandgold If Namco Bandai's One Piece Unlimited World is anything to go by, no local retail release means no eShop either =\\"",""in_reply_to_status_id"":493925918998425600,""id"":493926581610364928,""favorite_count"":0,""source"":""Twitter Web Client"",""retweeted"":false,""coordinates"":null,""entities"":{""symbols"":[],""user_mentions"":[{""id"":139852376,""indices"":[0,8],""id_str"":""139852376"",""screen_name"":""xaaronh"",""name"":""Aaron""},{""id"":74393990,""indices"":[9,24],""id_str"":""74393990"",""screen_name"":""blueredandgold"",""name"":""Leigh""}],""hashtags"":[],""urls"":[]},""in_reply_to_screen_name"":""xaaronh"",""in_reply_to_user_id"":139852376,""retweet_count"":0,""id_str"":""493926581610364928"",""favorited"":false,""user"":{""follow_request_sent"":false,""profile_use_background_image"":true,""default_profile_image"":false,""id"":42302246,""profile_background_image_url_hp"":""hp://pbs.twimg.com/profile_background_images/464279459932020736/v1xnMcrV.jpeg"",""verified"":false,""profile_text_color"":""333333"",""profile_image_url_https"":""hp://pbs.twimg.com/profile_images/490791031487463424/udSldTQ3_normal.png"",""profile_sidebar_fill_color"":""DDEEF6"",""entities"":{""description"":{""urls"":[{""url"":""hp:tttt"",""indices"":[67,89],""expanded_url"":""hp://infernalmonkey.com"",""display_url"":""infernalmonkey.com""}]}},""followers_count"":506,""profile_sidebar_border_color"":""000000"",""id_str"":""42302246"",""profile_background_color"":""1A1B1F"",""listed_count"":22,""is_translation_enabled"":false,""utc_offset"":36000,""statuses_count"":8676,""description"":""I probably tweet about video games and onaholes. Let's be friends! (NSFW)"",""friends_count"":261,""location"":""Sydney, Australia"",""profile_link_color"":""2FC2EF"",""profile_image_url"":""hp://pbs.twimg.com/profile_images/490791031487463424/udSldTQ3_normal.png"",""following"":false,""geo_enabled"":false,""profile_banner_url"":""hp://pbs.twimg.com/profile_banners/42302246/1406105444"",""profile_background_image_url"":""hp://pbs.twimg.com/profile_background_images/464279459932020736/v1xnMcrV.jpeg"",""screen_name"":""infernal_monkey"",""lang"":""en"",""profile_background_tile"":false,""favourites_count"":2018,""name"":""Lance McGill"",""notifications"":false,""url"":null,""created_at"":""Sun May 24 23:20:25 +0000 2009"",""contributors_enabled"":false,""time_zone"":""Sydney"",""protected"":false,""default_profile"":false,""is_translator"":false},""geo"":null,""in_reply_to_user_id_str"":""139852376"",""lang"":""en"",""_id"":""493926581610364928"",""created_at"":""Tue Jul 29 01:10:48 +0000 2014"",""in_reply_to_status_id_str"":""493925918998425600"",""place"":null,""metadata"":{""iso_language_code"":""en"",""result_type"":""recent""}}"
This is not the best solution, just an effort from a noob (me), definitely need further editing for better output. I am using windows OS.
import csv
import json
abc=[]
myList=[]
myDict={}
myFile=open('fileName.csv','r',encoding='utf-8')
myReader=csv.reader(myFile)
header=next(myReader)
for line in myReader:
abc=json.loads(line[4])
myDict=abc
myList.append(myDict['text'])
dct={}
for eachLine in myList:
item=eachLine.split()
for one in item:
if one in dct:
dct[one]+=1
else:
dct[one]=1
finalList=list(dct.items())
finalList.sort()
Related
this one's difficult and I haven't found any answers for hours, I hope you can help me. I'm not an english native speaker, I apologize in advance.
I arrived last week in a company and am working with .json files which are all listed in directories, by companies.
e.g.
d/Company1/comport/enregistrement_sessionhash1
enregistrement_sessionhash2
enregistrement_sessionhash3
d/Company2/comport/enregistrement_sessionhashX
d/Company3/comport/enregistrement_sessionhashY...
Each of them can contain [0-n] characters.
We use these files to calculate data.
The person before me didn't think about classifying them by /year/months, therefore it takes a lot of time when we do algorithms on data during a specific month, because it reads all the files inside the directory, which are being stored every 10 seconds per websitecompany and website-user user for approximately 2 years.
Sadly, we can't use systems' creation/modification time, only text informations in the .json files, since there's been a server problem and my coworkers had to paste files, resetting creation time.
Here is a template of the .json files
BEGIN OF FILE
{"session":"session_hash","enregistrements":[{"session":"session_hash",[...]{"data2":"xxx"}],"timedate_saved":"27 04 2020 12:39:21"},{"session":"session_hash",[...],"timedate_saved":"17 06 2020 11:01:08"},{"data1":"session_hash"[...],{"data2":"xxx"}],"timedate_saved":"27 04 2020 18:01:14"}]}
END OF FILE
In a file, there can't be a different "session" value. This value is a hash, used aswell in the filename e.g. d/Company1/comport/enregistrement_session_hash
I would like to read the files, cut every "enregistrements" sub-arrays (starting with [{"session"...and ending with "timedate_saved":"01 01 1970 00:00:00"}]}. Doing this, i want the cutted out text to get written in files having the same filenames (session_hash), stored by company/comport/year/months/enregistrement_sessionhash, gotten by the "timedate_saved" data. And of course be able to reuse these files for further use, so having the .json parsing.
That's a lot, I hope someone has time on his hands to help me getting through it.
I'm trying to read a log file in R.
It looks like an extract from a JSON file to me, but when trying to read it using jsonlite I get the following error message: "Error: parse error: trailing garbage".
Here is how my log file look like:
{"date":"2017-05-11T04:37:15.587Z","userId":"admin","module":"Quote","action":"CreateQuote","identifier":"-.admin1002"},
{"date":"2017-05-11T05:12:24.939Z","userId":"a145fhyy","module":"Quote","action":"Call","identifier":"RunUY"},
{"date":"2017-05-11T05:12:28.174Z","userId":"a145fhyy","license":"named","usage":"External","module":"Catalog","action":"OpenCatalog","identifier":"wks.klu"},
Has you can see, the column name is precised directly in front of the content for each line (e.g: "date": or "action":)
And some line can skip some columns and add some other.
What I want to get as output would be to have 7 columns with the corresponding data filled in each:
date
userId
license
usage
module
action
identifier
Does anyone has a suggestion about how to get there?
Thanks a lot in advance
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Thanks everyone for your answers. Here are some precisions about my issue:
The data that I gave as example in an extract of one of my log files. I've got a lot of them that I need to read as one unique table.
I haven't added any commas or anything to it.
#r2evans
I've tried the following:
Log3 <-read.table("/Projects/data/analytics.log.agregated.2017-05-11.log") jsonlite::stream_in(textConnection(gsub(",$","",Log3)))
It returns the following error:
Error: lexical error: invalid char in json text.
c(17, 18, 19, 20, 21, 22, 23, 2
(right here) ------^
I'm not sure how to use sed -e 's/,$//g' infile > outfile and Sys.which("sed"), that something I'm not familiar with. I'm looking into it, but if you have anymore precisions to give me about the usage of it that would be great.
I have saved your example as a file "test.json" and was able to read and parse it like this:
library(jsonlite)
rf <- read_file("test.json")
rfr <- gsub("\\},", "\\}", rf)
data <- stream_in(textConnection(rfr))
It parses and simplifies into a neat data frame exactly like you want. What I do is look for "}," rather than ",$", because the very last comma is not (necessarily) followed by a newline character(s).
However, this might not be the best solution for very large files.. For those you may need to first look for a way to modify the text file itself by getting rid of the commas. Or, if that's possible, ask the people who exported this file to export it in a normal ndjson format:-)
I am using a simple program to read CSV file, somehow I noticed when I created a CSV using EXCEL or windows based computer go library fails to read it. even when I use cat command it only shows me last line on the terminal. It always results in this error extraneous " in field.
I researched somewhat than I found it is somewhat related to carriage return differences between OS.
But I really want to ask how to make a generic csv reader. I tried reading the same csv using pandas and it was reading successfully. But i am not been able to achieve this using my Go code.
Also screen shot of correct csv Is here
Your file clearly shows that you've got an extra quote at the end of the content. While programs like pandas may be fine with that, I assume it's not valid csv so go does return an error.
Quick example of what's wrong with your data: https://play.golang.org/p/KBikSc1nzD
Update: After your update and a little bit of searching, I have to apoligize, the carriage return does matter and seems to be tha main culprit here, Go seems to be ok handling the \r\n windows variant but not the \r one. In that case what you can do is wrap the bytes.Reader into a custom reader that replaces the \r byte with the \n byte.
Here's an example: https://play.golang.org/p/vNjzwAHmtg
Please note, that the example is just that, an example, it's not handling all the possible cases where \r might be a legit byte.
Currently working on a Data Mining project using my own dataset I had found using Weka. The only issue is that taking my file from csv format and converting it into arff format is causing issues.
java.io.IOException: wrong number of values. Read 2, expected 5, Read Token[EOL], line 3
This is the error I am getting. I have browsed around online looking for similar issues and have tried removing all quotes and special characters that throw this exception. Every place I looked told me to remove special characters and I believe there are none left. The link to my dataset is here : https://docs.google.com/spreadsheets/d/1xqEe7MZE9SdKB_yvFSgWeSVYuDrq0b31Eu5oECNbGH0/edit#gid=1736568367&vpid=A1
This is the first three lines of my file where the first is the attribute names, file is separated by commas in note
Inequality Adjusted HPI Rank,Sub Region,Inequality Adjusted Life Expectancy,Inquality Adjusted Well being,Footprint
,Inequality adjusted HPI
1,1,73.1,6.9,2.5,48.2
2,6,65.17333333,5.487667631,1.390974448,45.97489063
If you open your file with a text editor, you will see that Footprint has quotes around it. Delete the quotes and you are good to go!
Weka is normally not that good in reading CSV files that include special characters, and ARFF files are normally easier to use. Therefore, in such cases, the easiest way is to convert your CSV file to an ARFF file using R ("RWeka" and "foreign" libraries can handle this conversion).
There is also another possibility. I was creating my CSV file and the header had a different number of elements compared to the rest of the data. So, check the header as well...!
Firstly, I'm just learning Python (which is my first language) so, while I recognise there are numerous websites that address this, I've spent a weekend trying to get my head around implementing a solution and got nowhere with it. So, I'm hoping someone here can help me :)
The problem is simple: I've created a list of lists in a Python program, and I need to output them to a *.csv file so I can import it into Excel etc.
The list looks like this:
[['title1','title2','title3'],['date1','info1','category1'],['date2','info2','category3'],...]
I've found solutions where the elements in each list are integers, I can't get them to work with strings.
Any help on this would be much appreciated!
Thanks,
Adam
There's a CSV module that can do this:
import csv
data = [['title1','title2','title3'],['date1','info1','category1'],['date2','info2','category3']]
with open('stuff.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
for line in data:
writer.writerow(line)