stream_in part of .json file - json

I have a large .json file and I only want to read in a part of it.
I tried the the following solutions but they didn´t work:
yelp <- stream_in(file("yelp_academic_dataset_review.json"), paigesize = 500)
yelp <- stream_in(file("yelp_academic_dataset_review.json"), nrows = 500)
Anyone know how it works?

First off- always helpful to provide the packages you are using, in your case jsonlite.
One solution is parsing the data file (as a .txt file) prior to streaming it in.
yelp <- readLines("yelp_academic_dataset_review.json")[1:500]
yelp <- stream_in(textConnection(gsub("\\n", "", yelp)))
I'm assuming your file is local?

I have had success with actual piping/streaming json in the past. Ie, from the command line,
cat x.json | parse_json.py
Then you write your python script:
import json,sys
for line in sys.stdin:
js_line = json.loads(line.rstrip())
try:
# do something with js_line['x']['y']
except ValueError:
pass
I'm not sure why you want to use stream_in, but this somewhat manual approach can be effective

I use this code for extracting 1400001 to 1450000 lines of yelp:
setwd("d:/yelp_dataset")
rm(list=ls())
library(jsonlite)
rev<- 'd:/yelp_dataset/review.JSON'
revu<-jsonlite::stream_in(textConnection(readLines(rev)[1400001:1450000],verbose=F)

Related

Using sys.stdout.write() to create multiple files in NiFi?

I have a pipeline in NiFi that pulls down some invalid JSON that I need to clean up. The best solution I've concocted is to run a Python script via ExecuteStreamCommand and simultaneously clean/split it up in one fell swoop. However, even though I use sys.stdout.write() in my for loop, only the original JSON comes out in the output stream in NiFi.
Am I misusing sys.stdout.write() or is this possible, but I've just done something wrong? My end goal is for each line of the json to be a new flow file, i.e. file 1 is {"fruit":"apple",..., file 2 is {"fruit":"cherry",..., and so on.
example JSON
{"fruit":"apple", "vegetable":"celery", "location":{"country":"nor\\way", "city":"oslo", }, "color":"blue"}
{"fruit":"cherry", "vegetable":"kale", "location":{"country":"france", "city":"calais", }, "color":"green"}
{"fruit":"peach", "vegetable":"peas", "location":{"country":"united\\kingdom", "city":"london", }, "color":"yellow"}
script
import json
import re
import sys
flow_file = sys.stdin.read()
try:
load = json.loads(flow_file)
sys.stdout.write(flow_file)
except:
flow_file_esc = re.sub(r"[(\\)]", "", flow_file)
for f in flow_file_esc.splitlines():
sys.stdout.write(str(f))
Can you clean the file first with ReplaceText and then split it with SplitJson, SplitRecord, or ForkRecord?
If you need to combine the two operations and want to script it, you could try ExecuteScript with Jython (since it doesn't look like you're using native CPython libraries), I have some simple examples in my cookbook and my blog.

How to get CSV to work in in RevitPythonShell?

Has anyone figured out how to get CSV (or any other package) to work in RevitPythonShell? I've only been able to get Excel from Interop to work.
When I try running csv in RPS, the terminal executes and shows no error or any kind of feed back, and the file is not created either.
This is the basic code I'm trying to run which comes from a tutorial on CSV I believe.
with open('mycsv2.csv', 'w') as f:
fieldnames = ['column1', 'column2', 'column3']
thewriter = csv.DictWriter(f, fieldnames=fieldnames)
thewriter.writeheader()
for i in range(1, 10):
thewriter.writerow({'column1':'one', 'column2':'two', 'column3':'three'})
I find CSV much more user friendly and easier to understand than Interop Excel. I believe I've read its doable somewhere but of course I cant find the source now.
All help, tips, or tricks are appreciated.
I can get it to work by supplying the full path name to the open function, so it looks like (showing full path to my Documents Folder):
import csv
with open(r'C:\Users\callum\Documents\mycsv2.csv', 'w') as f:
fieldnames = ['column1', 'column2', 'column3']
thewriter = csv.DictWriter(f, fieldnames=fieldnames)
thewriter.writeheader()
for i in range(1, 10):
thewriter.writerow({'column1':'one', 'column2':'two', 'column3':'three'})
Let me know if that does the trick!

how to read 7z json file in R

Cannot find the answer how to load 7z file in R. I can't use this:
s <- system("7z e -o <path> <archive>")
because of error 127. Maybe that's because I'm on Windows? However, 7z opens when I click in TotalCommander.
I'm trying something like this:
con <- gzfile(path, 'r')
ff <- readLines(con, encoding = "UTF-8")
h <- fromJSON(ff)
I have Error:
Error: parse error: trailing garbage
7z¼¯' ãSp‹ Ë:ô–¦ÐÐY#4U¶å¿ç’
(right here) ------^
The encoding is totally not there, when I load this file uncompressed it's ok without specifying the encoding. Moreover it's 2x longer. I have thousands of 7z files need to read them one by one in a loop, read, analyze and get out. Could anyone give me some hints how to do it effectively?
When uncompressed it easily works using:
library(jsonlite)
f <- read_json(path, simplifyVector = T)
EDIT
There are many json files in one 7z file. The above error is probably caused by parser which reads raw data of whole file. I don't know how to link these files or specify the connection attributes.

Import multiple json files from a directory and attaching the data

I am trying to read multiple json files into a working directory for further converting into a dataset. I have files text1, text2, text3 in the directory json. Here is the code i wrote:
setwd("Users/Desktop/json")
temp = list.files(pattern="text*.")
myfiles = lapply(temp, read.delim)
library("rjson")
json_file <- "myfiles"
library(jsonlite)
out <- jsonlite::fromJSON(json_file)
out[vapply(out, is.null, logical(1))] <- "none"
data.frame(out, stringsAsFactors = FALSE)[,1:5]
View(out)
I have about 200 files so i was wondering if there is way in which the json files can be imported.
Thanks
I think that I had a similar problem when working w/ Twitter data. I had a directory containing separate files for each user name, and I wanted to import/analyze them as a group. This worked for me:
library(rjson)
filenames <- list.files("Users/Desktop/json", pattern="*.json", full.names=TRUE) # this should give you a character vector, with each file name represented by an entry
myJSON <- lapply(filenames, function(x) fromJSON(file=x)) # a list in which each element is one of your original JSON files
If this doesn't work, then I need a bit more information to understand your problem.

Is there any R package to convert PDF to HTML [duplicate]

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}