Sample-Gene-Expression data storage problem in mysql - mysql

I have 50 gb sample-gene expression data .I want to store this data in mysql files.Data is divided into three txt files .one is sample ,2nd is gene ,and third is a sample-gene matrix which store their expression values.
I tried with three tables,one is for sample,2nd for gene and third with two foreign keys sample id,geneid and a field exp_value .But problem is how i can store that matrix in this table.

Please read
https://dev.mysql.com/doc/refman/8.0/en/load-data.html
You have data in text files, hoping that it's already formatted with separators. If it's formatted then it's easy to import.
If you are using linux use Konsole. If you are using Windows use CMD. It will take while to import it for that large file size. You just have to wait. It's gonna be a lot of trials and errors at first.

Related

unable to import csv table DATEs columns to BigQuery

I am unable to import csv table > DATEs columns to BigQuery,
DATEs are not recognized, even they have correct format according this docu
https://cloud.google.com/bigquery/docs/schema-detect YYYY-MM-DD
So DATEs columns are not recognized and are renamed to _2020-0122, 2020-01-23...
Is the issue that DATES are in 1st row as column name ?
But How can I then import dates, when I want use them in TimeSeries Charts (DataStudio) ?
here is sample source csv>
Province/State,Country/Region,Lat,Long,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-026
Anhui,China,31.8257,117.2264,1,9,15,39,60
Beijing,China,40.1824,116.4142,14,22,36,41,68
Chongqing,China,30.0572,107.874,6,9,27,57,75
Here is ig from Bigquery
If you have finite number of days, you can try unpivot table when using it. See blog post.
otherwise, if you dont know how many day columns in csv file.
choose a unique character as csv delimiter then just load whole file into a single column staging table, then use split function. you'll also need unnest. This approach requires a full scan and will be more expensive, especially when file gets bigger.
The issue is that in column names you cannot have a date type, for this reason when the CSV is imported it takes the dates and transforms them to the format with underscores.
The first way to face the problem would be modifying the CSV file, because any import with the first row as a header will change the date format and then it will be harder to get to date type again. If you have any experience in any programming language you can do the transformation very easily. I can help doing this but I do not know your use case so maybe this is not possible. Where does this CSV come from?
If the CSV previous modification is not possible then the second option is what ktopcuoglu said, importing the whole file as one column and process this using SQL function. This is way harder than the first option and as you import all the data into a single column, all the data will have the same data type, what will be a headache too.
If you could explain where the CSV comes from we may be able to influence it before being ingested by BigQuery. Else, you'll need to deep into SQL a bit.
Hope it helps!
Hi, now I can help you further.
First I found some COVID datasets into the public bigquery datasets. The one you are taking from github is already in BigQuery, but there are many others that may work better for your task such as the one called “covid19_ecdc”, that is inside bigquery-public-data. This last one has the confirmed cases and deaths per date and country so it should be easy to make a time series.
Second, I found an interesting link performing what you meant with python and data studio. It’s a kaggle discussion so you may not be familiar with it, but it deserves a check for sure . Moreover, he is using the dataset you are trying to use.
Hope it helps. Do not hesitate to ask!

Splitting Large CSV files by Column

I have a vey large (4gb) csv file. Cannot open in excel or in other editors. The number of lines (rows) is nearly 3,000 and the number of columns is nearly 320,000.
One solution is to split the original file into smaller ones and be able to open these small ones in Excel or other editors.
The second solution is to take the transpose of the original data then open it in the Excel.
I could not find a tool or script for transpose. I've found some scripts and free software for splitting but each of them splits the csv by row size.
Is there a way to split the original file into smaller ones that consist of max 15000 rows.
I tried to use:
import panda as pd
pd.read_csv(%file Path%).T.to_csv('%new File Path%,headre=false)
But it take ages to complete
In meanwhile I tired to use some python coding, but all of them failed because of the memory issues.
Trial version of the Delimit (http://www.delimitware.com/) handled the data perfectly.

Manipulating 100gb Table

I have a particular dataset in tsv format seperated by tabs that is one big txt file of around 100gb (Somewhere around 255 million rows). I have to filter and extract relevant rows so I can easily work on them. So far, I know that Excel can't handle that many rows, and familliar text editors can't open or very painful to work with tables. I've tried LogParser, a 36 mins query gave me a csv output but unfortunately exported number of rows are way below what I guess is present in the data. Also I get some parsing errors and some columns in the exported sets are shifted. Do you have any other alternatives? Maybe can I somehow turn the data into an SQL database? Is it possible?

ETL: how to guess data types for messy CSVs with lots of nulls

I often have to cleanse and import messy CSV and Excel files into my MS SQL Server 2014 (but the question would be the same if I were using Oracle or another database).
I have found a way to do this with Alteryx. Can you help me understand if I can do the same with Pentaho Kettle or SSIS? Alternatively, can you recommend another ETL software which addresses my points below?
I often have tables of, say, 100,000 records where the first 90,000 records may be null. Most ETL tools scan only the first few hundred records to guess data types and therefore fail to guess the types of these fields. Can I force Pentaho or SSIS to scan the WHOLE file before guessing types? I understand this may not be efficient for huge files of many GBs, but for the files I handle scanning the entire file is much better than wasting a lot of time trying to guess each field manually
As above, but with the length of a string. If the first 10,000 records are, say, a 3-character string but the subsequent ones are longer, SSIS and Pentaho tend to guess nvarchar(3) and the import will fail. Can I force them to scan all rows before guessing the length of the strings? Or, alternatively, can I easily force all strings to be nvarchar(x) , where I set x myself?
Alteryx has a multi-field tool, which is particularly convenient when cleansing or converting multiple fields. E.g. I have 10 date columns whose datatype was not guessed automatically. I can use the multi-field formula to get Alteryx to convert all 10 fields to date and create new fields called $oldfield_reformatted. Do Pentajho and SSIS have anything similar?
Thank you!
A silly suggestion. In Excel add a row at the top of the list that has a formula that creates a text string with the same length of the longest value in the column.
This formula entered as an array formula would do it..
=REPT("X",MAX(LEN(A:A)))
You could also use a more advanced VBA function to create other dummy values to force datatypes in SSIS.
I've not used SSIS or anything like it, but in the past I would have loaded a file into a table with columns ALL of varchar 1000 say so that all the data loaded, then processed it across into the main table using SQL that casts or removes the data values as I required.
This gives YOU Ultimate control not a package or driver. I was very surprised to hear how this works!

Random records extracted from a large CSV file

I have 50 CSV files, up to 2 millions records in each.
I daily need to get 10000 random records from each of the 50 files and make a new CSV files with all the info (10000*50)
I can not do it manually, because will take me a lot of time, also I've tried to use Access, but, because database is larger then 2G, I cannot use it.
Also, I've tried to use CSVed - a good soft, but still did not help me.
Could someone please give an idea/soft in order to get random records from files and make a new CSV file?
There are many languages you could use, I would use C# and do this.
1) Get the number of lines in a file.
Lines in text file
2) Generate the 10,000 random numbers (unique if you need that) based on the maximum being the count from step 1.
Random without duplicates
3) Pull the records from step 2 from the file and write to new file.
4) Repeat for each file.
Other options if you want to consider a database other than Access are MySQL or SQL Server Express to name a couple.