I am new to freemarker and trying to extract list of value defined within bracket. Example below..
alternative-id provider="group" level="episode" description="CHICAGO MED YR 5 (Sign) Ep 514">20595987</alternative-id
alternative-id provider="group" level="episode" description="CHICAGO MED YR 5 (U) Ep 514">20670620</alternative-id
Would like to list values within the ()
Output will be as below
Sign
U
OR
Sign,U
Couldn't get list command working to extract data.. it only extract data within ><
20595987
20670620
Tested this successfully on https://try.freemarker.apache.org/:
Template
<#assign bracketValues = data?matches("<alternative-id([\\w\\s\=\"]*)\\(([\\w\\s\\d]*)\\)")>
<#list bracketValues as b>
${b?groups[2]}<#sep>,</#list>
Data model
data = "<alternative-id provider=\"group\" level=\"episode\" description=\"CHICAGO MED YR 5 (Sign) Ep 514\">20595987</alternative-id><alternative-id provider=\"group\" level=\"episode\" description=\"CHICAGO MED YR 5 (U) Ep 514\">20670620</alternative-id>"
Result
Sign,U
Related
I have a file with, containing 61 zipfiles which each contain 17280 zip files.
The file exists of sensordata for the months April and May and the 61 zipfiles exist of the sensordata per day for these months. The 17280 zip files contain the data for these days per 5 seconds.
I wrote a code to open the data per day in Jupyter Notebook, but I want to open them per month. Here is my code for opening the data per day:
path = 'test_tudelft'
data = pd.DataFrame() # verzameling van alle data
for zip_filename in os.listdir(path): # loop over de zipbestanden
with zipfile.ZipFile(os.path.join(path, zip_filename)) as zf: # open een zipbestand
for file in zf.filelist: # loop over de ingepakte bestanden
# data = pd.concat((data, pd.read_json(zf.open(file), lines=True))) # lees de nieuwe data en voeg samen
data_new = pd.read_json(zf.read(file).decode('utf8')[2:-1], orient='index').T
data = pd.concat((data, data_new)) # lees de nieuwe data en voeg samen
data = data.reset_index(drop=True) # unieke index per bestand
This code works to make plots of the data per day, but I would like to plot the data per month. How can I change my code to open the zipfiles in the zipfiles?
I m not sure where is my error but i believe its with for. As it is, it print values from last record only. I tried to put print below statusInput = profile_data['_statusinput'] inside for but didn't work got:
IndentationError: unindent does not match any outer indentation level
Code
import json
with open('config/'+'config.json', 'r') as file:
data: list = json.load(file)
lista = data
for element in lista:
print("")
for alias_element in element:
#print("Alias: " +alias_element)
for result in element[alias_element]:
profile_data = result
aliasInput = profile_data['_aliasinput']
timesInput = profile_data['_timesinput']
idInput = profile_data['_idinput']
statusInput = profile_data['_statusinput']
print(f" Values from register are {aliasInput}{timesInput}{idInput}{statusInput}")
Result
Last record value only.
Example:
Values from register are test2 12:45 19:20 888888 true
Expected
Print values of all records on the screen. Understanding it... Also d like to add a condition that print only if statusInput == true
Example:
Values from register are test 10:20 11111 true
Values from register are test1 11:50 99999 true
Values from register are test2 12:45 19:20 888888 true
I have a list of more than 100,000 json files from which I want to get a data.table with only a few variables. Unfortunately the files are complex. The content of each json file looks like:
Sample 1
$id
[1] "10.1"
$title
$title$value
[1] "Why this item"
$itemsource
$itemsource$id
[1] "AA"
$date
[1] "1992-01-01"
$itemType
[1] "art"
$creators
list()
Sample 2
$id
[1] "10.2"
$title
$title$value
[1] "We need this item"
$itemsource
$itemsource$id
[1] "AY"
$date
[1] "1999-01-01"
$itemType
[1] "art"
$creators
type name firstname surname affiliationIds
1 Person Frank W. Cornell. Frank W. Cornell. a1
2 Person David A. Chen. David A. Chen. a1
$affiliations
id name
1 a1 Foreign Affairs Desk, New York Times
What I need from this set of files is a table with creator names, item ids and dates. For the two sample files above:
id date name firstname lastname creatortype
"10.1" "1992-01-01" NA NA NA NA
"10.2" "1999-01-01" Frank W. Cornell. Frank W. Cornell. Person
"10.2" "1999-01-01" David A. Chen. David A. Chen. Person
What I have done so far:
library(parallel)
library(data.table)
library(jsonlite)
library(dplyr)
filelist = list.files(pattern="*.json",recursive=TRUE,include.dirs =TRUE)
parsed = mclapply(filelist, function(x) fromJSON(x),mc.cores=24)
data = rbindlist(mclapply(1:length(parsed), function(x) {
a = data.table(item = parsed[[x]]$id, date = list(list(parsed[[x]]$date)), name = list(list(parsed[[x]]$name)), creatortype = list(list(parsed[[x]]$creatortype))) #ignoring the firstname/lastname fields here for convenience
b = data.table(id = a$item, date = unlist(a$date), name=unlist(a$name), creatortype=unlist(a$creatortype))
return(b)
},mc.cores=24))
However, on the last step, I get this error:
"Error in rbindlist(mclapply(1:length(parsed), function(x){:
Item 1 of list is not a data.frame, data.table or list"
Thanks in advance for your suggestions.
Related questions include:
Extract data from list of lists [R]
R convert json to list to data.table
I want to convert JSON file into data.table in r
How can read files from directory using R?
Convert R data table column from JSON to data table
from the error message, i suppose this basically means that one of the results from mclapply() is empty, by empty I mean either NULL or data.table with 0 row, or simply encounters an error within the parallel processing.
what you could do is:
add more checks inside the mclapply() like try-error or check the class of b and nrow of b, whether b is empty or not
when you use rbindlist, add argument fill = T
hope this solves ur problem.
I would appreciate a nudge in the right direction with this problem.
Below is a spider that:
1. crawls listing page and retrieves each record's summary info (10 rows/page)
2. follows the URL for to extract detailed info on each individual record's page
3. goes to the next listing page
Problem: each record's detailed info is extracting fine but each record contains the summary info of the last record from the same listing page.
Simplified example:
URL DA Detail1 Detail2
9 9 0 0
9 9 1 1
9 9 2 2
9 9 3 3
9 9 4 4
9 9 5 5
9 9 6 6
9 9 7 7
9 9 8 8
9 9 9 9
With the scrapy shell, I can iterate through manually and get the correct values as shown below:
import scrapy
from cbury_scrapy.items import DA
for row in response.xpath('//table/tr[#class="datrack_resultrow_odd" or #class="datrack_resultrow_even"]'):
r = scrapy.Selector(text=row.extract(), type="html")
print r.xpath('//td[#class="datrack_danumber_cell"]//text()').extract_first(), r.xpath('//td[#class="datrack_danumber_cell"]//#href').extract_first()[-5:]
Output
SC-18/2016 HQQM=
DA-190/2016 HQwQ=
DA-192/2016 HQAk=
S68-122/2016 HQgM=
DA-191/2016 HQgc=
DA-223/2015/A HQQY=
DA-81/2016/A GSgY=
PCA-111/2016 GSwU=
PCD-101/2016 GSwM=
PCD-100/2016 GRAc=
When the spider is run, the last record summary details will repeat for each record on the same listing page. Please see the spider below, the offending method seems to be the first 10 lines of the parse method.
""" Run under bash with:
timenow=`date +%Y%m%d_%H%M%S`; scrapy runspider cbury_spider.py -o cbury-scrape-$timenow.csv
Problems? Interactively check Xpaths etc.:
scrapy shell "http://datrack.canterbury.nsw.gov.au/cgi/datrack.pl?search=search&sortfield=^metadata.date_lodged""""
import scrapy
from cbury_scrapy.items import DA
def td_text_after(label, response):
""" retrieves text from first td following a td containing a label e.g.:"""
return response.xpath("//*[contains(text(), '" + label + "')]/following-sibling::td//text()").extract_first()
class CburySpider(scrapy.Spider):
# scrapy.Spider attributes
name = "cbury"
allowed_domains = ["datrack.canterbury.nsw.gov.au"]
start_urls = ["http://datrack.canterbury.nsw.gov.au/cgi/datrack.pl?search=search&sortfield=^metadata.date_lodged",]
# required for unicode character replacement of '$' and ',' in est_cost
translation_table = dict.fromkeys(map(ord, '$,'), None)
da = DA()
da['lga'] = u"Canterbury"
def parse(self, response):
""" Retrieve DA no., URL and address for DA on summary list page """
for row in response.xpath('//table/tr[#class="datrack_resultrow_odd" or #class="datrack_resultrow_even"]'):
r = scrapy.Selector(text=row.extract(), type="html")
self.da['da_no'] = r.xpath('//td[#class="datrack_danumber_cell"]//text()').extract_first()
self.da['house_no'] = r.xpath('//td[#class="datrack_houseno_cell"]//text()').extract_first()
self.da['street'] = r.xpath('//td[#class="datrack_street_cell"]//text()').extract_first()
self.da['town'] = r.xpath('//td[#class="datrack_town_cell"]//text()').extract_first()
self.da['url'] = r.xpath('//td[#class="datrack_danumber_cell"]//#href').extract_first()
# then retrieve remaining DA details from the detail page
yield scrapy.Request(self.da['url'], callback=self.parse_da_page)
# follow next page link if one exists
next_page = response.xpath("//*[contains(text(), 'Next')]/#href").extract_first()
if next_page:
yield scrapy.Request(next_page, self.parse)
def parse_da_page(self, response):
""" Retrieve DA information from its detail page """
labels = { 'date_lodged': 'Date Lodged:', 'desc_full': 'Description:',
'est_cost': 'Estimated Cost:', 'status': 'Status:',
'date_determined': 'Date Determined:', 'decision': 'Decision:',
'officer': 'Responsible Officer:' }
# map DA fields with those in the following <td> elements on the page
for i in labels:
self.da[i] = td_text_after(labels[i], response)
# convert est_cost text to int for easier sheet import "12,000" -> 12000
if self.da['est_cost'] != None:
self.da['est_cost'] = int(self.da['est_cost'].translate(self.translation_table))
# Get people data from 'Names' table with 'Role' heading
self.da['names'] = []
for row in response.xpath('//table/tr[th[1]="Role"]/following-sibling::tr'):
da_name = {}
da_name['role'] = row.xpath('normalize-space(./td[1])').extract_first()
da_name['name_no'] = row.xpath('normalize-space(./td[2])').extract_first()
da_name['full_name'] = row.xpath('normalize-space(./td[3])').extract_first()
self.da['names'].append(da_name)
yield self.da
Your help would be much appreciated.
Scrapy is asynchronous, once you've submitted a request there's no guarantee when that request will be actioned. Because of this your self.da is unreliable for passing data to parse_da_page. Instead create da_items = DA() in your parse routine and pass it in the request as meta.
for row in response.xpath(...):
da_items = DA()
da_items['street'] = row.xpath(...)
...
da_items['url'] = row.xpath(...)
yield scrapy.Request(da_items['url'], callback=self.parse_da_page, meta=da_items)
Then in parse_da_page you can retrieve these values using response.meta['street'] etc.. Have a look at the docs here.
Note also that your line r = scrapy.Selector(text=row.extract(), type="html") is redundant, you can simply use the variable row directly as I've done in my example above.
I'm making word frequency tables with R and the preferred output format would be a JSON file. sth like
{
"word" : "dog",
"frequency" : 12
}
Is there any way to save the table directly into this format? I've been using the write.csv() function and convert the output into JSON but this is very complicated and time consuming.
set.seed(1)
( tbl <- table(round(runif(100, 1, 5))) )
## 1 2 3 4 5
## 9 24 30 23 14
library(rjson)
sink("json.txt")
cat(toJSON(tbl))
sink()
file.show("json.txt")
## {"1":9,"2":24,"3":30,"4":23,"5":14}
or even better:
set.seed(1)
( tab <- table(letters[round(runif(100, 1, 26))]) )
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 4 3 2 5 4 3 5 3 9 4 7 2 2 2 5 5 5 6 5 3 7 3 2 1
sink("lets.txt")
cat(toJSON(tab))
sink()
file.show("lets.txt")
## {"a":1,"b":2,"c":4,"d":3,"e":2,"f":5,"g":4,"h":3,"i":5,"j":3,"k":9,"l":4,"m":7,"n":2,"o":2,"p":2,"q":5,"r":5,"s":5,"t":6,"u":5,"v":3,"w":7,"x":3,"y":2,"z":1}
Then validate it with http://www.jsonlint.com/ to get pretty formatting. If you have multidimensional table, you'll have to work it out a bit...
EDIT:
Oh, now I see, you want the dataset characteristics sink-ed to a JSON file. No problem, just give us a sample data, and I'll work on a code a bit. Practically, you need to carry out the data into desirable format, hence convert it to JSON. list should suffice. Give me a sec, I'll update my answer.
EDIT #2:
Well, time is relative... it's a common knowledge... Here you go:
( dtf <- structure(list(word = structure(1:3, .Label = c("cat", "dog",
"mouse"), class = "factor"), frequency = c(12, 32, 18)), .Names = c("word",
"frequency"), row.names = c(NA, -3L), class = "data.frame") )
## word frequency
## 1 cat 12
## 2 dog 32
## 3 mouse 18
If dtf is a simple data frame, yes, data.frame, if it's not, coerce it! Long story short, you can do:
toJSON(as.data.frame(t(dtf)))
## [1] "{\"V1\":{\"word\":\"cat\",\"frequency\":\"12\"},\"V2\":{\"word\":\"dog\",\"frequency\":\"32\"},\"V3\":{\"word\":\"mouse\",\"frequency\":\"18\"}}"
I though I'll need some melt with this one, but simple t did the trick. Now, you only need to deal with column names after transposing the data.frame. t coerces data.frames to matrix, so you need to convert it back to data.frame. I used as.data.frame, but you can also use toJSON(data.frame(t(dtf))) - you'll get X instead of V as a variable name. Alternatively, you can use regexp to clean the JSON file (if needed), but it's a lousy practice, try to work it out by preparing the data.frame.
I hope this helped a bit...
These days I would typically use the jsonlite package.
library("jsonlite")
toJSON(mydatatable, pretty = TRUE)
This turns the data table into a JSON array of key/value pair objects directly.
RJSONIO is a package "that allows conversion to and from data in Javascript object notation (JSON) format". You can use it to export your object as a JSON file.
library(RJSONIO)
writeLines(toJSON(anobject), "afile.JSON")