how to read 7z json file in R - json

Cannot find the answer how to load 7z file in R. I can't use this:
s <- system("7z e -o <path> <archive>")
because of error 127. Maybe that's because I'm on Windows? However, 7z opens when I click in TotalCommander.
I'm trying something like this:
con <- gzfile(path, 'r')
ff <- readLines(con, encoding = "UTF-8")
h <- fromJSON(ff)
I have Error:
Error: parse error: trailing garbage
7z¼¯' ãSp‹ Ë:ô–¦ÐÐY#4U¶å¿ç’
(right here) ------^
The encoding is totally not there, when I load this file uncompressed it's ok without specifying the encoding. Moreover it's 2x longer. I have thousands of 7z files need to read them one by one in a loop, read, analyze and get out. Could anyone give me some hints how to do it effectively?
When uncompressed it easily works using:
library(jsonlite)
f <- read_json(path, simplifyVector = T)
EDIT
There are many json files in one 7z file. The above error is probably caused by parser which reads raw data of whole file. I don't know how to link these files or specify the connection attributes.

Related

stream_in part of .json file

I have a large .json file and I only want to read in a part of it.
I tried the the following solutions but they didn´t work:
yelp <- stream_in(file("yelp_academic_dataset_review.json"), paigesize = 500)
yelp <- stream_in(file("yelp_academic_dataset_review.json"), nrows = 500)
Anyone know how it works?
First off- always helpful to provide the packages you are using, in your case jsonlite.
One solution is parsing the data file (as a .txt file) prior to streaming it in.
yelp <- readLines("yelp_academic_dataset_review.json")[1:500]
yelp <- stream_in(textConnection(gsub("\\n", "", yelp)))
I'm assuming your file is local?
I have had success with actual piping/streaming json in the past. Ie, from the command line,
cat x.json | parse_json.py
Then you write your python script:
import json,sys
for line in sys.stdin:
js_line = json.loads(line.rstrip())
try:
# do something with js_line['x']['y']
except ValueError:
pass
I'm not sure why you want to use stream_in, but this somewhat manual approach can be effective
I use this code for extracting 1400001 to 1450000 lines of yelp:
setwd("d:/yelp_dataset")
rm(list=ls())
library(jsonlite)
rev<- 'd:/yelp_dataset/review.JSON'
revu<-jsonlite::stream_in(textConnection(readLines(rev)[1400001:1450000],verbose=F)

How to check encoding of a CSV file

I have a CSV file and I wish to understand its encoding. Is there a menu option in Microsoft Excel that can help me detect it
OR do I need to make use of programming languages like C# or PHP to deduce it.
You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.
In Linux systems, you can use file command. It will give the correct encoding
Sample:
file blah.csv
Output:
blah.csv: ISO-8859 text, with very long lines
If you use Python, just use a print() function to check the encoding of a csv file. For example:
with open('file_name.csv') as f:
print(f)
The output is something like this:
<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf8'>
You can also use python chardet library
# install the chardet library
!pip install chardet
# import the chardet library
import chardet
# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open("test.csv", 'rb') as file:
print(chardet.detect(file.read()))
Use chardet https://github.com/chardet/chardet (documentation is short and easy to read).
Install python, then pip install chardet, at last use the command line command.
I tested under GB2312 and it's pretty accurate. (Make sure you have at least a few characters, sample with only 1 character may fail easily).
file is not reliable as you can see.
Or you can execute in python console or in Jupyter Notebook:
import csv
data = open("file.csv","r")
data
You will see information about the data object like this:
<_io.TextIOWrapper name='arch.csv' mode='r' encoding='cp1250'>
As you can see it contains encoding infotmation.
CSV files have no headers indicating the encoding.
You can only guess by looking at:
the platform / application the file was created on
the bytes in the file
In 2021, emoticons are widely used, but many import tools fail to import them. The chardet library is often recommended in the answers above, but the lib does not handle emoticons well.
icecream = '🍦'
import csv
with open('test.csv', 'w') as f:
wf = csv.writer(f)
wf.writerow(['ice cream', icecream])
import chardet
with open('test.csv', 'rb') as f:
print(chardet.detect(f.read()))
{'encoding': 'Windows-1254', 'confidence': 0.3864823918622268, 'language': 'Turkish'}
This gives UnicodeDecodeError while trying to read the file with this encoding.
The default encoding on Mac is UTF-8. It's included explicitly here but that wasn't even necessary... but on Windows it might be.
with open('test.csv', 'r', encoding='utf-8') as f:
print(f.read())
ice cream,🍦
The file command also picked this up
file test.csv
test.csv: UTF-8 Unicode text, with CRLF line terminators
My advice in 2021, if the automatic detection goes wrong: try UTF-8 before resorting to chardet.
In Python, You can Try...
from encodings.aliases import aliases
alias_values = set(aliases.values())
for encoding in set(aliases.values()):
try:
df=pd.read_csv("test.csv", encoding=encoding)
print('successful', encoding)
except:
pass
As it is mentioned by #3724913 (Jitender Kumar) to use file command (it also works in WSL on Windows), I was able to get encoding information of a csv file by executing file --exclude encoding blah.csv using info available on man file as file blah.csv won't show the encoding info on my system.
import pandas as pd
import chardet
def read_csv(path: str, size: float = 0.10) -> pd.DataFrame:
"""
Reads a CSV file located at path and returns it as a Pandas DataFrame. If
nrows is provided, only the first nrows rows of the CSV file will be
read. Otherwise, all rows will be read.
Args:
path (str): The path to the CSV file.
size (float): The fraction of the file to be used for detecting the
encoding. Defaults to 0.10.
Returns:
pd.DataFrame: The CSV file as a Pandas DataFrame.
Raises:
UnicodeError: If the encoding of the file cannot be detected with the
initial size, the function will retry with a larger size (increased by
0.20) until the encoding can be detected or an error is raised.
"""
try:
byte_size = int(os.path.getsize(path) * size)
with open(path, "rb") as rawdata:
result = chardet.detect(rawdata.read(byte_size))
return pd.read_csv(path, encoding=result["encoding"])
except UnicodeError:
return read_csv(path=path, size=size + 0.20)
Hi, I just added a function to find the correct encoding and read the csv in the given file path. Thought it would be useful
Just add the encoding argument that matches the file you`re trying to upload.
open('example.csv', encoding='UTF8')

R json,incomplete final line found

My problem: R json,incomplete final line found
My effort: I followed 'Incomplete final line' warning when trying to read a .csv file into R
I used this site to check my files validity. It is data from my facebook news feed collected using graph api.
My code:
library("rjson")
work<-"C:/ContainingFolder/"
json_data <- fromJSON(paste(readLines(paste0(work,"SunwayFB.txt")), collapse=""))
My error:
Warning message:
In readLines(paste0(work, "SunwayFB.txt")) :
incomplete final line found on 'C:/ContainingFolder/SunwayFB.txt'
It works without any errors if you read the file with fromJSON instead of readLines.
fp <- file.path(work, "SunwayFB.txt")
json_data <- fromJSON(file = fp)
By the way: For the readLines way, you have to add a new line at the end of the file.
You can ignore the warning message.
readLines(paste0(work,"SunwayFB.txt"))
add warn field.
readLines(paste0(work,"SunwayFB.txt"), warn=FALSE)
In most cases, Incomplete final line warnings can be averted by appending a new line to the file you are trying to open. Just go to the end of file -> press enter -> Save the file -> re-run whatever command you are using to load it in R and it shall show no warning.

Is there any R package to convert PDF to HTML [duplicate]

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

Parsing multiple files for HTML tables and appending to single file in R

This is a small project in R that I am attempting to execute. I've scraped a few hundred html pages. I am able to use the reaHTMLTable function in the XML library with R to read the tables that I'm interested in. However I'm having trouble writing the for loop to loop through the directory, grab the table from each file and append them to a single CSV file.
I HAVE been successful in looping through the files and saving each table to a single txt file (which I feel is at least a start):
library(XML) # htmlTreeParse
parentpath <- "Z:/scraping"
setwd(parentpath)
filenames <- list.files()
for (targetfile in filenames){
setwd(parentpath)
data = readHTMLTable(targetfile)
outputfile <- paste(targetfile,'.txt', sep="")
write.table (data[6], file = outputfile , sep = "\t", quote=TRUE)
Shouldn't the append=TRUE option in write.table do the trick for you? You can read about it by looking up ?write.table.