Impossible use of mmap with zipfile open - binary

I can't find a way to use mmap with a file object openned with Zipfile from a zip archive. It seems that the issue is related to the 'r' mode of the open method of ZipFile.
Indeed, if i directly work with the uncompressed source file with open and 'rb' i can use mmap with no issue but with the zipped version openned with ZipFile i get the error: "UnsupportedOperation: fileno"
My code:
with zipfile.ZipFile(r"path\test.zip","r") as zip:
with zip.open(r"myfile", 'r', 0) as file:
s = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
=> error: UnsupportedOperation: fileno
I am looking for a way to search many zip files (large text files zipped) efficiently. Mmap is very good with my text files but seems to not work with the zipped versions of my files.
TY

Related

Why Google App Script UrlFetchApp when downloads a zip file changes its binary content?

I want to download a zip file in Google Drive via Google Apps Script.
After downloading a sample zip file with the code below and saving it into the folder in google drive.
const exampleUrl = "https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-zip-file.zip";
var response = UrlFetchApp.fetch(exampleUrl);
var parentFolder = DriveApp.getFolderById('1aba-tnQZxZMN7DN52eAywTU-Xs-eqOf4');
parentFolder.createFile('sample_CT.zip', response.getContentText()); // doesn't work
parentFolder.createFile('sample_C.zip', response.getContent()); // doesn't work
parentFolder.createFile('sample_B.zip', response.getBlob()); // doesn't work
parentFolder.createFile('sample.zip', response); // doesn't work
After downloading it on my machine I try to unpack with unzip utility but all of the above versions give me the following:
> unzip sample_CT.zip
Archive: sample_CT.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of sample_CT.zip or
sample_CT.zip.zip, and cannot find sample_CT.zip.ZIP, period.
In the picture I am comparing broken zip file (above) and the correct one (below):
broken:
PKuÔøΩÔøΩP
sample.txtUT
ÔøΩbÔøΩ^ÔøΩbÔøΩ^ÔøΩbÔøΩ^uxÔøΩÔøΩEÔøΩ1RÔøΩ0ÔøΩÔøΩÔøΩQÔøΩ0ÔøΩUz. ,
ÔøΩÔøΩXKÔøΩ!ÔøΩÔøΩ2ÔøΩÔøΩV#ÔøΩ6ÔøΩ:
ÔøΩÔøΩMÔøΩ
��#ux�h�ttPkHTѺ�H�b+�:N�>m�����h�`{�c�0�A��(yh���&���{�U~�Y�~�����HA�����k8w�p���6�Ik��k��?k"?OJx��(n벼g�_�tPK[�c�PKu��P[�c�
ÔøΩÔøΩsample.txtUT
ÔøΩbÔøΩ^ÔøΩbÔøΩ^ÔøΩbÔøΩ^uxÔøΩÔøΩPKX
correct:
PKu“¥P
sample.txtUT
Çb±^Çb±^Çb±^uxèèE1RÅ0ûœâQÑ0¹Uz. ,
ÎàXKþ!·ÿ2ð‡V#í®6œ:
£èMà
ï´#ux­hð®¸ttPkHTѺòH²b+ª:Nª>mô”Éä’h˜`{úcÌ0ÅAõš(yh®©»&ÊôÏ{ýU~°YÊ~“¾ËòöHA„Äü×÷k8wÏpùö¹6ÕIk»ðk¤ü?k"?OJxºØ(në²¼gª_ötPK[°c¶PKu“¥P[°c¶
´sample.txtUT
Çb±^Çb±^Çb±^uxèèPKX
The image in my text editor
As you can see in the picture (file snippets above) some symbols differ. I have no idea why UrlFetch changes certain bytes when it downloads a zip file.
Also on top it a file after UrlFetch takes more space.
It's because the script is converting it to string. Folder.createFile() accepts a blob, but it should be it's only argument. If it's passed as a second argument, other method signatures like Folder.createFile(name:string, content:string) takes precedence and Blob is converted to String to match the method signature.
parentFolder.createFile(response.getBlob().setName('TheMaster.zip'))

Autodesk Forge download object, but cannot tell if it is a Revit model or zip file

I was downloading Revit models from BIM360 team hub via ForgeAPI using the following uri.
https://developer.api.autodesk.com/oss/v2/buckets/:bucketKey/objects/:objectName
All my objectName ended with .rvt. So I downloaded and saved them as rvt file.
However I noticed that some of the files cannot be opened by Revit. They are actually not rvt files but zip files. So I have to change the extension to .zip and unzip the file to get real 'rvt` files.
My Problem is that not all files is zip file. I cannot tell from the API because the URI I request is always ended with .rvt.
Every Unix OS provides the file command, a standard utility program for recognising the type of data contained in a computer file:
https://en.wikipedia.org/wiki/File_(command)
A zip file is directly recognised and reported like this:
$ file test_dataset.zip
test_dataset.zip: Zip archive data, at least v2.0 to extract
A Revit RVT model is a Windows compound document file, so it generates the following output:
$ file little_house_2021.rvt
little_house_2021.rvt: Composite Document File V2 Document, Cannot read section info
Hence you can use the same algorithm as file does to distinguish between RVT and ZIP files.
Afaik, file just looks at the first couple of bytes in the given file.
The Python programming language offers similar utilities; try an Internet search
for distinguish file type python; the first hits explain
How to check type of files without extensions in Python
and point to the filetype Python project.
Other programming languages can provide similar functionality.

How should one zip a large folder in Windows 10, upload it to GDrive, then unzip it?

I have a directory consisting of 22 sub-directories. Altogether, the directory is about 750GB in size and I need this data on GDrive so that I can work with it in Google Colab. Obviously uploading this takes an absolute age (particularly with my slow connection) so I would like to zip it, upload it, then unzip it in the cloud.
I am using 7zip and zipping each subdirectory using the zip format and "normal" compression level. (EDIT: Can now confirm that I get the same error for 7z and tar format). Each subdirectory ends up between 14 and 20GB in size. I then upload this and attempt to unzip it in Google Colab using the following code:
drive.mount('/content/gdrive/')
!apt-get install p7zip-full
!7za x "/content/gdrive/My Drive/av_tfrecords/drumming_7zip.zip" -o"/content/gdrive/My Drive/unzipped_av_tfrecords/" -aos
This extracts some portion of the zip file before throwing an error. There are a variety of errors and sometimes the code will not even begin unzipping the file before throwing an error. This is the most common error:
Can not open the file as archive
ERROR: Unknown error -2147024891
Archives with Errors: 1
If I then attempt to rerun the !7za command, it may extract one or 2 files more from the zip file before throwing this error:
terminate called after throwing an instance of 'CInBufferException'
It may also complain about particular files within the zip archive:
ERROR: Headers Error : drumming/yt-g0fi0iLRJCE_23.tfrecords
I have also tried using:
!unzip -n "/content/gdrive/My Drive/av_tfrecords/drumming_7zip.zip" -d "/content/gdrive/My Drive/unzipped_av_tfrecords/"
But that just begins throwing errors:
file #254: bad zipfile offset (lseek): 8137146368
file #255: bad zipfile offset (lseek): 8168710144
file #256: bad zipfile offset (lseek): 8207515648
Although I would prefer a solution in Colab, I have also tried using an app available in GDrive named "Zip Extractor". But that too throws an error and has a dataquota.
This has now happened across 4 zip files and each time I try something new, it takes an a long time to try it out because of the upload speeds. Any explanations for why this is happening and how I can resolve the issue would be greatly appreciated. Also I understand there are probably alternatives to what I am trying to do and they would be appreciated also, even if they do not directly answer the question. Thank you!
I got same problem
Solve it by
new ProcessBuilder(new String[] {"7z", "x", fPath, "-o" + dir)
Use command line array, not just full line!
Luck!
Why does this command behave differently depending on whether it's called from terminal.app or a scala program?

'b' Added to file name when trying to load data file in Jupyter

When trying to load a data file into a Jupyter notebook I get the following error message
File b'data_file.csv' does not exist: b'data_file.csv'
Following suggestions I can find online on this problem, I tried the following variations, including specifying the full path and utf encoding
pd.read_csv("data_file.csv")
pd.read_csv("C:\\FULL_PATH\\EBI\\data_file.csv")
pd.read_csv(r"data_file.csv")
pd.read_csv(r"C:\\FULL_PATH\\EBI\\data_file.csv")
pd.read_csv("data_file.csv",encoding='utf-8')
pd.read_csv("C:\\FULL_PATH\\EBI\\data_file.csv",encoding='utf-8')
pd.read_csv(r"data_file.csv",encoding='utf-8')
pd.read_csv(r"C:\\FULL_PATH\\EBI\\data_file.csv",encoding='utf-8')
as well as
pd.read_csv('C:\\FULL_PATH\\EBI\\"data_file.csv"')
However, all of these yield the same error message
File b'data_file.csv' does not exist: b'data_file.csv'
Not sure if it is helpful to add that the Jupyter notebook is being run on a Windows Server 2012 platform. Please note that I checked using os.getcwd() that the full path is indeed as quoted above.
Any suggestions would be much appreciated!
Assuming the file is in your working directory, could you try:
import os
file = os.path.join(os.getcwd(),"data_file.csv")
df = pd.read_csv(file)

LOAD CSV command keeps using old file: location, ignores command input

I am using Community edition 3.0.5 on Windows 10 . I made multiple efforts to execute a LOAD CSV command before being told that such files cannot reside on an external drive. When I moved the file to users/user/ and tried to execute the LOAD CSV command I got the same message "Couldn't load the external resource at: file:/F:/Neo4j%20DBs/Data.gov%20Consumer%20Complaints/Consumer%20Complaints%20DB/import/Users/CharlieOh/Consumer_Complaints.csv" in spite of the fact the command I entered was
"LOAD CSV WITH HEADERS FROM
'file:///Users/CharlieOh/Consumer_Complaints.csv' AS line
WITH line
LIMIT 1
RETURN line"
I tried to locate the file neo4j.conf and could only find C:\Program Files (x86)\Neo4j Community 3.2.2\Neo4j Community.install4j\i4jparams.conf . I even deleted the old DB and recreated the small amount of data and got the same error, which seems to indicate that the LOAD CSV function is totally useless across all my neo4j databases. BTW the %20 in the file specification was due to suggestions on Stack Overflow as well as using underscores to avoid any use of blank spaces in the file specification. None of it worked and now that I believe that I may have solved the problem by putting the csv file in the user directory, the LOAD CSV function won't let me do it. One last thing, I am following the YouTube video https://www.youtube.com/watch?v=Eh_79goBRUk to learn how to load a csv file into neo4j.
The csv file needs to go in the import directory of the specific database. With Neo4j Desktop this is easy to identify by clicking on the Manage button of the database and then the open folder button. It looks like you've found it.
Once the database import directory is located, you specify it in the LOAD CSV with the statement LOAD CSV WITH HEADERS FROM 'file:///" + FN + "'where FN is your file name, including the csv extension. You do NOT use the full path; that is assumed.