Strange beahviour. `read` works when issuing commands, but not inside a function - function

I am just trying to restart with Julia (made some tries a couple of years ago but the libraries were still missing too much stuff).
I am now trying something really simple and can't figure out why doesn't work.
If I run these very same commands directly outside a function, I get what I want, but if I put them inside a function, I get an error when calling the read command inside my read_datafile function:
using ArgParse, ZipFile, CSV, DataFrames
function read_datafile(fp)
z = ZipFile.Reader(fp)
a = z.files[1]
df = DataFrame(CSV.File(read(a)))
return df
end
read_datafile("./folder1/test.zip")
SystemError: seek: Bad file descriptor
Stacktrace: [1] #systemerror#48 at ./error.jl:167 [inlined] [2]
systemerror at ./error.jl:167 [inlined] [3] seek at ./iostream.jl:129
[inlined] [4] read(::ZipFile.ReadableFile, ::Int64) at
/home/morgado/.julia/packages/ZipFile/fdYkP/src/ZipFile.jl:508 [5]
read at /home/morgado/.julia/packages/ZipFile/fdYkP/src/ZipFile.jl:504
[inlined] [6] read_datafile(::String) at ./In[14]:4 [7] top-level
scope at In[15]:1
EDIT:
Added more info.
using Pkg; Pkg.status()
Status `~/.julia/environments/v1.5/Project.toml`
[c7e460c6] ArgParse v1.1.1
[336ed68f] CSV v0.8.3
[a93c6f00] DataFrames v0.21.8
[92fee26a] GZip v0.5.1
[7073ff75] IJulia v1.23.1
[6f49c342] RCall v0.13.10
[fd094767] Suppressor v0.2.0
[70df011a] TableReader v0.4.0
[a5390f91] ZipFile v0.9.3

I found the answer, it's a 5 year old unsolved bug in the ZipFile package :( : https://github.com/fhs/ZipFile.jl/issues/14
Need to write the function with a global variable:
function read_datafile(fp)
global z = ZipFile.Reader(fp)
a = z.files[1]
df = DataFrame(CSV.File(read(a)))
return df
end

Related

ERROR: LoadError: BoundsError: attempt to access 429×20 ArrayLogical{2} at index [430, Base.Slice(Base.OneTo(20))] running Julia script

I am trying to run a julia script on 4 csv files. But I keep getting this error:
samde#DESKTOP-V6PBDGC MINGW64 ~/mimix (master)
$ C:/Users/samde/AppData/Local/Programs/Julia/Julia-1.4.2/bin/julia.exe scripts/fit-mcmc.jl --hyper nutnet-analysis/configs/hyper.yml --monitor nutnet-analysis/configs/monitor-mimix.yml --inits nutnet-analysis/configs/inits.yml --factors 20 nutnet-analysis/test-data samnutnet-results
Reading X.csv
Reading Y.csv
Reading Z.csv
Beginning MCMC sampling
ERROR: LoadError: BoundsError: attempt to access 429×20 ArrayLogical{2} at index [430, Base.Slice(Base.OneTo(20))]
Stacktrace:
[1] throw_boundserror(::ArrayLogical{2}, ::Tuple{Int64,Base.Slice{Base.OneTo{Int64}}}) at .\abstractarray.jl:537
[2] checkbounds at .\abstractarray.jl:502 [inlined]
[3] _getindex at .\multidimensional.jl:726 [inlined]
[4] getindex at .\abstractarray.jl:980 [inlined]
[5] (::MicrobiomeMixedModels.var"#14#63")(::ArrayLogical{2}, ::Int64) at C:\Users\samde\mimix\MicrobiomeMixedModels.jl\src\models\mimix.jl:109
[6] (::var"#36#37")(::Model) at .\array.jl:0
[7] setinits!(::ArrayStochastic{2}, ::Model, ::Array{Float64,2}) at C:\Users\samde\.julia\packages\Mamba\PkMTm\src\model\dependent.jl:173
[8] setinits!(::Model, ::Dict{Symbol,Any}) at C:\Users\samde\.julia\packages\Mamba\PkMTm\src\model\initialization.jl:11
[9] setinits!(::Model, ::Array{Dict{Symbol,Any},1}) at C:\Users\samde\.julia\packages\Mamba\PkMTm\src\model\initialization.jl:24
[10] mcmc(::Model, ::Dict{Symbol,Any}, ::Array{Dict{Symbol,Any},1}, ::Int64; burnin::Int64, thin::Int64, chains::Int64, verbose::Bool) at C:\Users\samde\.julia\packages\Mamba\PkMTm\src\model\mcmc.jl:30
[11] top-level scope at C:\Users\samde\mimix\scripts\fit-mcmc.jl:150
[12] include(::Module, ::String) at .\Base.jl:377
[13] exec_options(::Base.JLOptions) at .\client.jl:288
[14] _start() at .\client.jl:484
in expression starting at C:\Users\samde\mimix\scripts\fit-mcmc.jl:105
I have tried deleting the last lines in every csv file. I tried saving as normal CSV files and CSV UTF-8 files. Running wc -l shows them all to be at the same lengths. I have looked at similar questions but have had trouble understanding the solutions. Any idea what could fix this error?
Here are the files: https://github.com/samd1993/mimixtest.git
The code works on the sample files found here: https://github.com/nsgrantham/mimix/tree/master/nutnet-analysis/reduced-data
#mbauman here is the code for mimix.jl at line 109:
108 F = Stochastic(2,
109 (F_mean, N) -> MultivariateDistribution[
110 MvNormal(F_mean[i, :], 1.0) for i in 1:N
111 ],
false
Thank you,
Sam
I found the solution to my own problem with the help of #mbauman's hint. The error was occurring due to my csv files not having the same dimensions. Thus, if anyone runs into this issue, make sure your data files are intact and there aren't any extra columns or rows. Rudimentary...but important.
Cheers,
Sam

JULIA ODBC.query MethodError: no method matching eachcolumn

I've been using Julia for some graphical results, using ODBC to conect MS Access database to get the data.
The same function worked flawlesly two weeks ago, but now it throws an error:
ERROR: MethodError: no method matching eachcolumn(::Tables.CopiedColumns{NamedTuple{(:year, :Fact),Tuple{Array{Union{Missing, Int16},1},Array{Union{Missing, Float64},1}}}})
Closest candidates are:
eachcolumn(::Union{Function, Type}, ::Tables.Schema{names,nothing}, ::Any) where names at C:\Users\myuser\.julia\packages\Tables\TA7NF\src\utils.jl:109
eachcolumn(::Union{Function, Type}, ::Tables.Schema{names,types}, ::Any) where {names, types} at C:\Users\myuser\.julia\packages\Tables\TA7NF\src\utils.jl:66
Stacktrace:
[1] #fromcolumns#410(::Bool, ::typeof(DataFrames.fromcolumns), ::Tables.CopiedColumns{NamedTuple{(:anno, :Fact),Tuple{Array{Union{Missing, Int16},1},Array{Union{Missing, Float64},1}}}}) at C:\Users\myuser\.julia\packages\DataFrames\yH0f6\src\other\tables.jl:13
[2] (::DataFrames.var"#kw##fromcolumns")(::NamedTuple{(:copycols,),Tuple{Bool}}, ::typeof(DataFrames.fromcolumns), ::Tables.CopiedColumns{NamedTuple{(:anno, :Fact),Tuple{Array{Union{Missing, Int16},1},Array{Union{Missing,
Float64},1}}}}) at .\none:0
[3] #DataFrame#412(::Bool, ::Type{DataFrame}, ::ODBC.Query{missing,NamedTuple{(:anno, :Fact),Tuple{Union{Missing, Int16},Union{Missing, Float64}}},Tuple{Array{Union{Missing, Int16},1},Array{Union{Missing, Float64},1}}}) at C:\Users\myuser\.julia\packages\DataFrames\yH0f6\src\other\tables.jl:32
[4] DataFrame(::ODBC.Query{missing,NamedTuple{(:anno, :Fact),Tuple{Union{Missing, Int16},Union{Missing, Float64}}},Tuple{Array{Union{Missing, Int16},1},Array{Union{Missing, Float64},1}}}) at C:\Users\myuser\.julia\packages\DataFrames\yH0f6\src\other\tables.jl:23
[5] #query#15(::Bool, ::Bool, ::Dict{Int64,Function}, ::typeof(ODBC.query), ::ODBC.DSN, ::String, ::Type{DataFrame}) at C:\Users\myuser\.julia\packages\ODBC\YEzHX\src\Query.jl:390
[6] query(::ODBC.DSN, ::String, ::Type{DataFrame}) at C:\Users\myuser\.julia\packages\ODBC\YEzHX\src\Query.jl:385
[7] query(::ODBC.DSN, ::String) at C:\Users\myuser\.julia\packages\ODBC\YEzHX\src\Query.jl:376
[8] top-level scope at C:\Users\myuser\Documents\Fact.jl:94
Seems like there is some kind of incompatibility between Query.jl and tables.jl
Here is the code used:
using DataFrames
using DataStreams
using ODBC
using StatsBase
using Plots
myDNS = ODBC.DSN("Driver={Microsoft Access Driver (*.mdb, *.accdb)}; DBQ=C:/Users/myuser/Documents/Data.accdb")
strFactQuery = "SELECT YEAR(FFact) AS anno, SUM(Invoiced) AS Fact FROM Invoices GROUP BY YEAR(FFact)"
FactResults = ODBC.query(myDNS, strFactQuery)
Is there anyone with the same problem? maybe it's a Query.jl bug?
Regards
Downgrading to Tables v 0.2.11 solved the problem, but beeing able to get that version was a PITA as downgrading via pkg mannager was not possible because some dependency problems, so, all in all, I went to github, selected the version, copied the code and pasted over the library code in my environment...maybe not the ellegant way, but worked.

Error while trying to parse json into R

I have recently started using R and have a task regarding parsing json in R to get a non-json format. For this, i am using the "fromJSON()" function. I have tried to parse json as a text file. It runs successfully when i do it with just a single row entry. But when I try it with multiple row entries, i get the following error:
fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
lexical error: invalid char in json text.
[{'CategoryType':'dining','City':
(right here) ------^
> fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: trailing garbage
"mumbai","Location":"all"}] [{"JourneyType":"Return","Origi
(right here) ------^
> fromJSON("D:/Eclairs/Printing/test3.txt")
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: after array element, I expect ',' or ']'
:"mumbai","Location":"all"} {"JourneyType":"Return","Origin
(right here) ------^
The above errors are due to three different formats in which i tried to parse the json text, but the result was the same, only the location suggested by changed.
Please help me to identify the cause of this error or if there is a more efficient way o performing the task.
The original file that i have is an excel sheet with multiple columns and one of those columns consists of json text. The way i tried right now is by extracting just the json column and converting it to a tab separated text and then parsing it as:
fromJSON("D:/Eclairs/Printing/test3.txt")
Please also suggest if this can be done more efficiently. I need to map all the columns in the excel to the non-json text as well.
Example:
[{"CategoryType":"dining","City":"mumbai","Location":"all"}]
[{"CategoryType":"reserve-a-table","City":"pune","Location":"Kothrud,West Pune"}]
[{"Destination":"Mumbai","CheckInDate":"14-Oct-2016","CheckOutDate":"15-Oct-2016","Rooms":"1","NoOfPax":"3","NoOfAdult":"3","NoOfChildren":"0"}]
Consider reading in the text line by line with readLines(), iteratively saving the JSON dataframes to a growing list:
library(jsonlite)
con <- file("C:/Path/To/Jsons.txt", open="r")
jsonlist <- list()
while (length(line <- readLines(con, n=1, warn = FALSE)) > 0) {
jsonlist <- append(jsonlist, list(fromJSON(line)))
}
close(con)
jsonlist
# [[1]]
# CategoryType City Location
# 1 dining mumbai all
# [[2]]
# CategoryType City Location
# 1 reserve-a-table pune Kothrud,West Pune
# [[3]]
# Destination CheckInDate CheckOutDate Rooms NoOfPax NoOfAdult NoOfChildren
# 1 Mumbai 14-Oct-2016 15-Oct-2016 1 3 3 0

httr GET operation unable to access JSON response

I am trying to access the JSON response from an API call in my R script. The API call is succesful, and I can view the JSON response in the console. However, I am unable to access any data from it.
A sample code segment is:
require(httr)
target <- '#trump'
sentence<- 'Donald trump has a wonderful toupe, it really is quite stunning that a man can be so refined and elegant'
query <- url_encode(sentence)
target <- gsub('#', '', target)
endpoint <- "https://alchemy.p.mashape.com/text/TextGetTargetedSentiment?outputMode=json&target="
apiCall <- paste(endpoint, target, '&text=', query, sep = '')
resp <-GET(apiCall, add_headers("X-Mashape-Key" = sentimentKey, "Accept" = "application/json"))
stop_for_status(resp)
headers(resp)
str(content(resp))
content(resp, "text")
I followed examples in the httr quickstart guide from CRAN (here) as well as this stack.
Unfortunately, I keep getting either "unused parameters 'text' in content()" or "no definition exists for content() accepting a class of 'response.' Does anyone have any advice? PS the headers will print, and resp$content will print the raw bitstream
Expanding on the comment, you need to set the content type explicitly in the call to content(...). Since your code is not reproducible, here is an example using the Census Bureau's geocoder (which returns a json response).
library(httr)
url <- "http://geocoding.geo.census.gov/geocoder/locations/onelineaddress"
resp <-GET(url, query=list(address="1600 Pennsylvania Avenue, Washington DC",
benchmark=9,
format="json"))
json <- content(resp, type="application/json")
json$result$addressMatches[[1]]$coordinates
# $x
# [1] -77.038025
#
# $y
# [1] 38.898735
Assuming your are actually getting a json response, and that it is well-formed, simply using content(resp, type="application/json") should work.

Reading a huge json file in R , issues

I am trying to read very huge json file using R , and I am using the RJSON library with this commend json_data <- fromJSON(paste(readLines("myfile.json"), collapse=""))
The problem is that I am getting this error message
Error in paste(readLines("myfile.json"), collapse = "") :
could not allocate memory (2383 Mb) in C function 'R_AllocStringBuffer'
Can anyone help me with this issue
Well, just sharing my experience about read json file. the progress of
I am trying to read 52.8MB,19.7MB,1.3GB,93.9MB,158.5MB json files cost me 30minutes and finally auto resume R session, after that tried to apply parallel computing and would like to see the progress but failed.
https://github.com/hadley/plyr/issues/265
And then I tried to add the parameter pagesize = 10000, its work and more efficient then ever. Well, we only need read once and later save as RData/Rda/Rds format as by saveRDS.
> suppressPackageStartupMessages(library('BBmisc'))
> suppressAll(library('jsonlite'))
> suppressAll(library('plyr'))
> suppressAll(library('dplyr'))
> suppressAll(library('stringr'))
> suppressAll(library('doParallel'))
>
> registerDoParallel(cores=16)
>
> ## https://www.kaggle.com/c/yelp-recsys-2013/forums/t/4465/reading-json-files-with-r-how-to
> ## https://class.coursera.org/dsscapstone-005/forum/thread?thread_id=12
> fnames <- c('business','checkin','review','tip','user')
> jfile <- paste0(getwd(),'/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_',fnames,'.json')
> dat <- llply(as.list(jfile), function(x) stream_in(file(x),pagesize = 10000),.parallel=TRUE)
> dat
list()
> jfile
[1] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json"
[2] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_checkin.json"
[3] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json"
[4] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_tip.json"
[5] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_user.json"
> dat <- llply(as.list(jfile), function(x) stream_in(file(x),pagesize = 10000),.progress='=')
opening file input connection.
Imported 61184 records. Simplifying into dataframe...
closing file input connection.
opening file input connection.
Imported 45166 records. Simplifying into dataframe...
closing file input connection.
opening file input connection.
Found 470000 records...
I got the same problem while working with huge datasets in R.I had used jsonlite package in R for reading json in R.I had used the following code to read json in R:
library(jsonlite)
get_tweets <- stream_in(file("tweets.json"),pagesize = 10000)
here tweets.json is the my file name and the location where it exists,pagesize represents how many number of lines it reads in one iteration.Hope it helps.
For some reason the above solutions all caused R to terminate or worse.
This solution worked for me, with the same data set:
library(jsonlite)
file_name <- 'C:/Users/Downloads/yelp_dataset/yelp_dataset~/dataset/business.JSON'
business<-jsonlite::stream_in(textConnection(readLines(file_name, n=100000)),verbose=F)
Took about 15 minutes