I have this table of chat conversation mydf
mydf = structure(list(User = c("Ana", "Ana", "Brian", "Ana", "Brian"), Message = c("Hi",
"How are you?", "Good. You?", "Ok", "What's up?"), Time = structure(c(1512156236.17704,
1512156238.67704, 1512156241.17704, 1512156243.67704, 1512156246.17704
), class = c("POSIXct", "POSIXt"))), .Names = c("User", "Message",
"Time"), row.names = c(NA, -5L), class = "data.frame")
#> mydf
# User Message Time
#1 Ana Hi 2017-12-01 13:23:56
#2 Ana How are you? 2017-12-01 13:23:58
#3 Brian Good. You? 2017-12-01 13:24:01
#4 Ana Ok 2017-12-01 13:24:03
#5 Brian What's up? 2017-12-01 13:24:06
My goal is to convert this data into conversation format in HTML. I am currently doing it by adding tags to the data and saving it. Then I have to work some more with CSS to make it better. Is there an easier way in R?
#REMOVE REPEATING NAMES
mydf$User = with(rle(mydf$User), unlist(sapply(seq_along(values),
function(i) c(rep(values[i], 1), rep("", lengths[i] - 1)))))
#ADD TAGS
mydf$User = ifelse(mydf$User == "", "", paste0("<h2 class=\"user\">", mydf$User, "</h2>"))
mydf$Message = paste0("<h3 class=\"msg\">", mydf$Message, "</h3>")
mydf$Time = paste0("<span class=\"tm\">", mydf$Time, "</span>")
#SAVE HTML
writeLines(paste(paste(mydf$User, mydf$Message, mydf$Time), collapse = "\n"),"~/test.html")
I am not sure if this satisfies you, but I would go with an approach that extends the data.frame and writes it directly in the file
#REMOVE REPEATING NAMES
mydf$User = with(rle(mydf$User), unlist(sapply(seq_along(values),
function(i) c(rep(values[i], 1), rep("", lengths[i] - 1)))))
#SAVE HTML
write.table(
data.frame(
ifelse(mydf$User!="", "<h2 class=\"user\">",""), mydf$User, ifelse(mydf$User!="","</h2>",""),
"<h3 class=\"msg\">", mydf$Message, "</h3>",
"<span class=\"tm\">", mydf$Time, "</span>"),
file = "~/test.html", quote = F, col.names = F, row.names = F )
Related
I'm trying to get information from Wikidata. For example, to access to "cobalt-70" I use the API.
API_ENDPOINT = "https://www.wikidata.org/w/api.php"
query = "cobalt-70"
params = {
'action': 'wbsearchentities',
'format': 'json',
'language': 'en',
'search': query
}
r = requests.get(API_ENDPOINT, params = params)
print(r.json())
So there is a "claims" which gives access to the statements. Is there a best way to check if a value exists in the statement? For example, "cobalt-70" have the value 0.5 inside the property P2114. So how can I check if a value exists in the statement of the entity? As this example.
Is there an approach to access it. Thank you!
I'm not sure this is exactly what you are looking for, but if it's close enough, you can probably modify it as necessary:
import requests
import json
url = 'https://www.wikidata.org/wiki/Special:EntityData/Q18844865.json'
req = requests.get(url)
targets = j_dat['entities']['Q18844865']['claims']['P2114']
for target in targets:
values = target['mainsnak']['datavalue']['value'].items()
for value in values:
print(value[0],value[1])
Output:
amount +0.5
unit http://www.wikidata.org/entity/Q11574
upperBound +0.6799999999999999
lowerBound +0.32
amount +108.0
unit http://www.wikidata.org/entity/Q723733
upperBound +115.0
lowerBound +101.0
EDIT:
To find property id by value, try:
targets = j_dat['entities']['Q18844865']['claims'].items()
for target in targets:
line = target[1][0]['mainsnak']['datavalue']['value']
if isinstance(line,dict):
for v in line.values():
if v == "+0.5":
print('property: ',target[0])
Output:
property: P2114
I try a solution which consists to search inside the json object as the solution proposed here : https://stackoverflow.com/a/55549654/8374738. I hope it can help. Let's give you the idea.
import pprint
def search(d, search_pattern, prev_datapoint_path=''):
output = []
current_datapoint = d
current_datapoint_path = prev_datapoint_path
if type(current_datapoint) is dict:
for dkey in current_datapoint:
if search_pattern in str(dkey):
c = current_datapoint_path
c+="['"+dkey+"']"
output.append(c)
c = current_datapoint_path
c+="['"+dkey+"']"
for i in search(current_datapoint[dkey], search_pattern, c):
output.append(i)
elif type(current_datapoint) is list:
for i in range(0, len(current_datapoint)):
if search_pattern in str(i):
c = current_datapoint_path
c += "[" + str(i) + "]"
output.append(i)
c = current_datapoint_path
c+="["+ str(i) +"]"
for i in search(current_datapoint[i], search_pattern, c):
output.append(i)
elif search_pattern in str(current_datapoint):
c = current_datapoint_path
output.append(c)
output = filter(None, output)
return list(output)
And you just need to use:
pprint.pprint(search(res.json(),'0.5','res.json()'))
Output:
["res.json()['claims']['P2114'][0]['mainsnak']['datavalue']['value']['amount']"]
I am a beginner so cant figure a reason for the error in the following code when train.jsonl uses format like that
{"claim": "But he said if people really want to know if they have CHIP they can get a blood test that costs a few MONEYc1", "evidence": "sentenceID100037", "label": "0"}
{"claim": "This is rather a courtly formulation and would doubtless trigger further eyerolling if uttered in", "evidence": "sentenceID100038", "label": "0"}
The top part executes without problem and displays the data.
import pandas as pd
prefix = '/content/'
train_df = pd.read_json(prefix + 'train.jsonl', orient='records', lines=True)
train_df.head()
[See my Colab Notebook][https://colab.research.google.com/gist/lenyabloko/0e17ebe0f3a0e808779bc1fa95e9b24d/semeval2020-delex.ipynb]
I even tried this additional trick which explained comments about 0 column
prefix = '/content/'
train_df = pd.read_json(prefix + 'train_delex.jsonl', orient='columns')
train_df.to_csv(prefix+'train.tsv', sep='\t', index=False, header=False)
train_df = pd.read_csv(prefix + 'train.tsv', header=None)
train_df.head()
Now I see column labeled '0' instead of the original three columns {"claim": "...", "evidence": " ...", "label": "..."} from the above JSONL file (why is that?)
But when I add DataFrame code it results in error
train_df = pd.DataFrame({
'id': train_df[1],
'text': train_df[0],
'labels':train_df[2]
})
In light of the column named "0" this wouldn't work. But where did that column come from??
KeyError Traceback (most recent call last)
2 frames
<ipython-input-16-0537eda6b397> in <module>()
6
7 train_df = pd.DataFrame({
----> 8 'id': train_df[1],
9 'text': train_df[0],
10 'labels':train_df[2]
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __getitem__(self, key)
2993 if self.columns.nlevels > 1:
2994 return self._getitem_multilevel(key)
-> 2995 indexer = self.columns.get_loc(key)
2996 if is_integer(indexer):
2997 indexer = [indexer]
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 return self._engine.get_loc(key)
2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2901 if indexer.ndim > 1 or indexer.size > 1:
Here is the solution that worked for me:
import pandas as pd
prefix = '/content/'
test_df = pd.read_json(prefix + 'test_delex.jsonl', orient='records', lines=True)
test_df.rename(columns={'claim': 'text', 'evidence': 'id', 'label':'labels'}, inplace=True)
cols = test_df.columns.tolist()
cols = cols[-1:] + cols[:-1]
cols = cols[-1:] + cols[:-1]
test_df = test_df[cols]
test_df.to_csv(prefix+'test.csv', sep=',', index=False, header=False)
test_df.head()
I updated my shared Colab Notebook linked in the question above
I was able to take a text file, read each line, create a dictionary per line, update(append) each line and store the json file. The issue is when reading the json file it will not read correctly. the error point to a storing file issue?
The text file looks like:
84.txt; Frankenstein, or the Modern Prometheus; Mary Wollstonecraft (Godwin) Shelley
98.txt; A Tale of Two Cities; Charles Dickens
...
import json
import re
path = "C:\\...\\data\\"
books = {}
books_json = {}
final_book_json ={}
file = open(path + 'books\\set_of_books.txt', 'r')
json_list = file.readlines()
open(path + 'books\\books_json.json', 'w').close() # used to clean each test
json_create = []
i = 0
for line in json_list:
line = line.replace('#', '')
line = line.replace('.txt','')
line = line.replace('\n','')
line = line.split(';', 4)
BookNumber = line[0]
BookTitle = line[1]
AuthorName = line[-1]
file
if BookNumber == ' 2701':
BookNumber = line[0]
BookTitle1 = line[1]
BookTitle2 = line[2]
AuthorName = line[3]
BookTitle = BookTitle1 + ';' + BookTitle2 # needed to combine title into one to fit dict format
books = json.dumps( {'AuthorName': AuthorName, 'BookNumber': BookNumber, 'BookTitle': BookTitle})
books_json = json.loads(books)
final_book_json.update(books_json)
with open(path + 'books\\books_json.json', 'a'
) as out_put:
json.dump(books_json, out_put)
with open(path + 'books\\books_json.json', 'r'
) as out_put:
'books\\books_json.json', 'r')]
print(json.load(out_put))
The reported error is: JSONDecodeError: Extra data: line 1 column 133
(char 132) - adding this is right between the first "}{". Not sure
how json should look in a flat-file format? The output file as seen on
an editor looks like: {"AuthorName": " Mary Wollstonecraft (Godwin)
Shelley", "BookNumber": " 84", "BookTitle": " Frankenstein, or the
Modern Prometheus"}{"AuthorName": " Charles Dickens", "BookNumber": "
98", "BookTitle": " A Tale of Two Cities"}...
I ended up changing the approach and used pandas to read the text and then spliting the single-cell input.
books = pd.read_csv(path + 'books\\set_of_books.txt', sep='\t', names =('r','t', 'a') )
#print(books.head(10))
# Function to clean the 'raw(r)' inoput data
def clean_line(cell):
...
return cell
books['r'] = books['r'].apply(clean_line)
books = books['r'].str.split(';', expand=True)
I am using R to extract tweets and analyse their sentiment, however when I get to the lines below I get an error saying "Object of type 'closure' is not subsettable"
scores$drink = factor(rep(c("east"), nd))
scores$very.pos = as.numeric(scores$score >= 2)
scores$very.neg = as.numeric(scores$score <= -2)
Full code pasted below
load("twitCred.Rdata")
east_tweets <- filterStream("tweetselnd.json", locations = c(-0.10444, 51.408699, 0.33403, 51.64661),timeout = 120, oauth = twitCred)
tweets.df <- parseTweets("tweetselnd.json", verbose = FALSE)
##function score.sentiment
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
# Parameters
# sentences: vector of text to score
# pos.words: vector of words of postive sentiment
# neg.words: vector of words of negative sentiment
# .progress: passed to laply() to control of progress bar
scores = laply(sentences,
function(sentence, pos.words, neg.words)
{
# remove punctuation
sentence = gsub("[[:punct:]]", "", sentence)
# remove control characters
sentence = gsub("[[:cntrl:]]", "", sentence)
# remove digits?
sentence = gsub('\\d+', '', sentence)
# define error handling function when trying tolower
tryTolower = function(x)
{
# create missing value
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error=function(e) e)
# if not an error
if (!inherits(try_error, "error"))
y = tolower(x)
# result
return(y)
}
# use tryTolower with sapply
sentence = sapply(sentence, tryTolower)
# split sentence into words with str_split (stringr package)
word.list = str_split(sentence, "\\s+")
words = unlist(word.list)
# compare words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# get the position of the matched term or NA
# we just want a TRUE/FALSE
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# final score
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
# data frame with scores for each sentence
scores.df = data.frame(text=sentences, score=scores)
return(scores.df)
}
pos = readLines(file.choose())
neg = readLines(file.choose())
east_text = sapply(east_tweets, function(x) x$getText())
scores = score.sentiment(tweetseldn.json, pos, neg, .progress='text')
scores()$drink = factor(rep(c("east"), nd))
scores()$very.pos = as.numeric(scores()$score >= 2)
scores$very.neg = as.numeric(scores$score <= -2)
# how many very positives and very negatives
numpos = sum(scores$very.pos)
numneg = sum(scores$very.neg)
# global score
global_score = round( 100 * numpos / (numpos + numneg) )
If anyone could help with as to why I'm getting this error it will be much appreciated. Also I've seen other answeres about adding '()' when referring to the variable 'scores' such as scores()$.... but it hasn't worked for me. Thank you.
The changes below got rid of the error:
x <- scores
x$drink = factor(rep(c("east"), nd))
x$very.pos = as.numeric(x$score >= 2)
x$very.neg = as.numeric(x$score <= -2)
I've been asked to help parse some log files for a Symantec application (Altiris) and they were delivered to me in a pseudo-HTML/XML format. I've managed to use readLines() and grepl() to get the logs into a decent character vector format and clean out the junk, but can't get it into a data-frame.
As of right now, an entry looks something like this (since I can't post real data), all in a character vector with structure chr[1:312]:
[310] "<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >"
I've had no luck with XML parsing and it does look more like HTML to me, and when I tried htmlTreeParse(x) I just ended up with a massive pyramid of tags.
If you're working with pseudo-XML, it's probably best to define the parsing rules yourself. I like stringr and dplyr for stuff like this.
Here's a two-element vector (instead of 312 in your case):
vec <- c(
"<severity='4', hostname='computername125', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='234' >",
"<severity='5', hostname='computername126', source='PackageDownload', module='herpderp.dll', process='masterP.exe', pid='235' >"
)
Convert it to a data.frame object:
df <- data.frame(vec, stringsAsFactors = FALSE)
And select out your data based on their character index positions, relative to the positions of your variables of interest:
require(stringr)
require(dplyr)
df %>%
mutate(
severityStr = str_locate(vec, "severity")[, "start"],
hostnameStr = str_locate(vec, "hostname")[, "start"],
sourceStr = str_locate(vec, "source")[, "start"],
moduleStr = str_locate(vec, "module")[, "start"],
processStr = str_locate(vec, "process")[, "start"],
pidStr = str_locate(vec, "pid")[, "start"],
endStr = str_locate(vec, ">")[, "start"],
severity = substr(vec, severityStr + 10, hostnameStr - 4),
hostname = substr(vec, hostnameStr + 10, sourceStr - 4),
source = substr(vec, sourceStr + 8, moduleStr - 4),
module = substr(vec, moduleStr + 8, processStr - 4),
process = substr(vec, processStr + 9, pidStr - 4),
pid = substr(vec, pidStr + 5, endStr - 3)) %>%
select(severity, hostname, source, module, process, pid)
Here's the resulting data frame:
severity hostname source module process pid
1 4 computername125 PackageDownload herpderp.dll masterP.exe 234
2 5 computername126 PackageDownload herpderp.dll masterP.exe 235
This solution is robust enough to handle string inputs of different lengths. For example, it would read pid in correctly even if it's 95 (two digits instead of three).