Google openrefine don't load big csv-file

Google openrefine don't load big csv-file - csv

When i try to create project, i load csv file with 3,5 millions rows(400mb)
and refine doesn't upload it.
it indicates 100% 1037 mb
i opened refine.ini and fixed memory limit, but there is no result
NOTE: This file is not read if you run the Refine executable directly
# It is only read of you use the refine shell script or refine.bat
no_proxy="localhost,127.0.0.1"
#REFINE_PORT=3334
#REFINE_HOST=127.0.0.1
#REFINE_WEBAPP=main\webapp
# Memory and max form size allocations
#REFINE_MAX_FORM_CONTENT_SIZE=104857600
REFINE_MEMORY=100000M
# Set initial java heap space (default: 256M) for better performance with large datasets
REFINE_MIN_MEMORY=100000M
# Some sample configurations. These have no defaults.
#ANT_HOME=C:\grefine\tools\apache-ant-1.8.1
#JAVA_HOME=C:\Program Files\Java\jdk1.6.0_25
#JAVA_OPTIONS=-XX:+UseParallelGC -verbose:gc -Drefine.headless=true
#JAVA_OPTIONS=-Drefine.data_dir=C:\Users\user\AppData\Roaming\OpenRefine
# Uncomment to increase autosave period to 60 mins (default: 5 minutes) for better performance of long-lasting transformations
#REFINE_AUTOSAVE_PERIOD=60
What i should do?

Based on the testing I did and published at https://groups.google.com/d/msg/openrefine/-loChQe4CNg/eroRAq9_BwAJ, to process 3.5 million rows you probably need to allocate around 8Gb RAM to have a reasonably responsive project.
As documented in OpenRefine changing the port and host when executable is run directly, when running OpenRefine on Windows where you set the options depends on whether you are starting OpenRefine via the exe file or the bat file.
To allocate over 4Gb of RAM, you definitely need to be using a 64-bit Java version - please check what version of Java OpenRefine is running in (it will use the Java specified in JAVA_HOME). However, you may find issues allocating 4Gb on 32-bit Java on a 64-Bit OS (see Maximum Java heap size of a 32-bit JVM on a 64-bit OS)

Related

SSIS low on virtual memory in debug mode

I run packages in debug mode in Visual Studio, but often get 'low on virtual memory' warnings,
I also often run multiple instances of Visual Studio to enable multiple packages to run for different purposes simultaneously.
The machine has 64 gig of memory, and more than 30 gig is free at the time I start getting the following in the output window. Any ideas? (I've tried it with larger values for DefaultBufferMaxRows and corresponding increases in DefaultBufferSize)
Information: 0x4004800C at Data Flow Task, SSIS.Pipeline: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 32 buffers were considered and 32 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.
Information: 0x4004800F at Data Flow Task: Buffer manager allocated 40 megabyte(s) in 4 physical buffer(s).

What file system does MySQL use?

Does MySQL use fread, read, mmap, or another file system when saving database data to the disk on a Linux OS? Or is MySQL doing a test to see which one to use? This is not in reference to saving config data. I'm interested in the actual database, preferably InnoDB.
Thanks for any help.
Edit: To be more specific, I'm interested in the c/c++ source code in MySQL that does the actual calls that saves data to a InnoDB database. Possible options are fread, read, mmap, among others.

What file system does MySQL use?
The MySQL access method code (InnoDB, MyISAM, AriaDB and the rest) uses the native file system of the host volume on the host operating system. NTFS on Windows, ext4fs on U**X systems, etc. The competent platform ports use a variety of I/O techniques including memory mapping, scatter/gather and ordinary read and write system calls, and integrate with the file systems' journaling features. The exact techniques used depend on the kind of query, the access method, and the state of caches.
Pro tip: Don't worry about this for performance reasons unless your server is running on an old 32-bit 486 machine you found in a storeroom (or unless you have millions of users and billions of rows of data).

On Linux systems all POSIX fileystems will work. fread is a libc construct that will translate to underlying syscalls like read, mmap, write etc.
The read, mmap, write operations are implemented in a Linux VFS (virtual file system) layer before those map to specific operations in the filesystem code. So any POSIX filesystem will work with MySQL.
The only filesystem test I've seen in the MySQL code is a fallocate syscall which isn't implemented on all filesystems (especially when it was first added, its probably significantly available now). There is an implementation workaround when fallocate isn't available.

Mysql slow on windows, fast on linux. Why?

I have installed a SpringMVC Web application with JPA and a Mysql Database.
The application is displaying statistics from the database (with a lot of selects)
It works quite fast on Linux(mysql 5.5.54), but it is very slow on Windows 10 (mysql 5.6.38).
Do you know what could cause such a behaviour on Windows?
Or could you give me hints or tell me where to search?
[UPDATE]
Linux : Intel® Core™ i7-4510U CPU # 2.00GHz × 4 / 8GoRAM
Windows : Intel Xeon CPU E31220 3.1Ghz 4GoRAM
I know that the windows machine is not as "powerful" than the linux one. I wonder if, by increasing the memory, that could be enough. Or does Mysql needs a lot of CPU too.

My list would be:
Check configs are identical - not just the settings in my.ini - values not set here are set at compile time and the 2 instances have definitely been compiled seperately! You'll need to capture and compare the output of SHOW VARIABLES
Check file deployment is similar - whether innodb is configured to use one file per table, whether the files are distributed across multiple disks
Check adequate memory available for caching on MSWindows
disable anti-virus
Make sure MSWindows is configured as a server (prioritize background tasks)
Windows sucks, deal with it :)

CloudSQL database crashes periodically (Out of memory)

We are having a problem where our cloudSQL database crashes periodically.
The error we are seeing in the logs is:
[ERROR] InnoDB: Write to file ./ib_logfile1failed at offset 237496832, 1024 bytes should have been written, only 0 were written. Operating system error number 12. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.
From what I understand, error number 12 means 'Cannot allocate memory'. Is there a way we can configure cloudsql to leave a larger buffer of free memory? The alternative would be to upgrade to have more memory, but from what I understand cloudSQL automatically uses all the memory available to it... Is this likely to reduce the problem or would it likely continue in the same way?
Are there any other things we can do to reduce this issue?

It is possible your system is running out of disk space rather than memory, especially if you are running in a HA config.
(If disk isn't the issue you should file a GCP support ticket rather than here)

HDD space in Ubuntu Apache server is running out

I've created a 10 GB HDD and 3.75 GB RAM instance in Google Cloud and hosted a quite heavy DB transaction application's backend/API there. The OS is Ubuntu 14.04 LTS and I'm using Apache web server with PHP and MySQL for the backend. The problem here is that the HDD space has almost run out of memory very quickly.
Using Linux (Ubuntu) commands, I've found that my source code (/var/www/html) size is about 200 MB and the MySQL DB folder (/var/lib/mysql) size is 3.7 GB (around 20,000,000 records in my project DB). I'm confused how rest of my HDD space is occupied (except OS files). As of today, I only have 35 MB left. Once for testing purpose, I copied the source code to another folder. Even then I had the same problem. When I realized that my HDD space is running out, I deleted that folder and freed around 200 MB. But later (around 10 minutes) that freed space has also gone!!!
I figured that some log file like Apache error log, access log, MySQL error log or CakePHP debug log may occupy that space but I've disabled and truncated those files long ago and checked if these file are creating again but it doesn't. So how????????
I'm seriously worried about this project to continue with this instance. I thought about adding additional HDD to remedy this situation but I need to be sure how my HDD space is being occupied first. Any help will be highly appreciated.

You can start by searching all the largest files in your system.
On the / directory type:
sudo find . -type f -size +5000k -exec ls -lh {} \;
Once you find the files you can start to troubleshoot.
If you get many file you can increase +5000k to aim for the larger files.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Google openrefine don't load big csv-file - csv

Related

SSIS low on virtual memory in debug mode

What file system does MySQL use?

Mysql slow on windows, fast on linux. Why?

CloudSQL database crashes periodically (Out of memory)

HDD space in Ubuntu Apache server is running out

Categories

Resources