I'm loading large text file of high school students into MySQL, but the school itself is only identified in the first line of each text file. Like so:
897781234Metropolitan High
340098 1001X 678 AS Reading 101KAS DOE KEITH A1 340089 A 7782...
Using SQL code, how can I generate a column of the school number (e.g., 897781234) in the first column of the receiving table so that the school will be identified with each row?
To load the text files, I'm using:
LOAD DATA INFILE "f:/school_files/school897781234.txt"
INTO TABLE my_table FIELDS TERMINATED BY ''
IGNORE 1 LINES;
Thanks!
Hmmm ... looks like you're doing this under Windows. I prefer Unix/Linux for large text manipulation, but you should be able to use similar techniques under Windows (try installing Cygwin). PowerShell also has some useful capabilities, if you're familiar with that. With that in mind, here are some ideas for you:
Write a script that will munge your data files to make them MySQL friendly, by creating a new file that has the contents of all but the first line with the school information prepended on every line. Do your data load from the munged file.
(munge-schools.sh)
#!/bin/bash
ifile=$1
ofile=$2
school=$(head -1 ${ifile})
tail --lines=+2 ${ifile} | sed "s/^/${school}/" > ${ofile}
./munge-schools school897781234.txt school897781234.munged
For each school, do the load as is (skipping the first line), but load it into a temporary table, then add a column for the school defaulting to the school information. Copy from the temp table into your final table.
Given a choice, I will always go with doing text manipulation outside of the database to make the input files friendlier -- there are lots of text manipulation tools that will be much faster at reformatting your data than your database's bulk load tools.
Related
I have a very large sql file (14GB). Currently, i am not able to open this file on my browser or VS code because it is too huge, keeps crashing and would take so long. However, there is a single table that i want in this huge sql file.
Is there a way of splitting the sql file to get the specific table that i am searching for ? Any helpful answer please ?
You can do:
Step 1: grep ${YourTableName} -rni path/to/you/file
In output you'll see string which is matching ${YourTableName} and line number.
Step 2: tail -25 path/to/you/file > path/to/your/fileChunk (where -25 must be replaced with number from grep command), now in file path/to/your/fileChunk you will have at top stuff related to your table.
Step 3 (optional): Your file path/to/your/fileChunk at top contains stuff related to your table but in the middle and in the bottom of file you may have stuff related to other tables so please repeat step 1 & 2 on file path/to/your/fileChunk and delete needless info.
PS: It's only idea how to split your huge file into chunks, but you have to adapt this commands to your values.
for the same data, I first save it as '.mat' file on Matlab and it takes up 2.31M; and then save it into MySQL: 9.85M;
I want to save it(or them,there are many) into MySQL, and how shall I make them smaller?
update---------------
I use python to save sql file as below code:
"cur.execute('create table if not exists test(date date,open float,close float,high float,low float,vol bigint,turn float)')
cur.executemany('insert into test values(%s,%s,%s,%s,%s,%s,%s)', np.column_stack(Matrix).tolist()) ";
In the fact, I need many tables like table "test"; and I just want to save the data and next time I would read out one by one table; and do some other things by python; I want to save each table as small as possible; so any adivice about this?
I've got problems in my work with Neo4j, and if you please to help, I will thank you a lot!
My work is something like this. I´ve got to study and evaluate several graph databases, and to do that I must use a benchmark. The benchmark that I'm used is the Social Network Benchmark (SNB)
I generate files with different setups all accordingly to the setup chosen. Something similar to this: forum_0.csv
This .csv files got certain headers, like this: id | title | creationDate | etc...
The next step in my project is to load them to Neo4j, build a database to test them with certain query’s, and my problems start here at this point.
I have loaded some files to Neo4j but others don't because of errors and I don't understand why.
I'm using this code to load those files. In this example I load the forum.csv to Neo4j.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM ".../forum_0.csv" AS csvLine
FIELDTERMINATOR "|"
CREATE (:FORUM_0 {id:csvLine.id, title:csvLine.title, creationDate:csvLine.creationDate})
And with this code, the data from this file is loaded to Neo4j correctly.
But with this file - forum_containerOf_post_0.csv I can´t load the data correctly with this header - Forum.id | Post.id.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM ".../forum_containerOf_post_0.csv" AS csvLine
FIELDTERMINATOR "|"
CREATE (:FCOP_0 {Forum.id:csvLine.Forum.id, Post.id:csvLine.Post.id})
The problem in here is I can´t access the id of forum_0.csv in the load process of forum_containerOf_post_0.csv. How can I access to that id, or another property? Is the lack of some Cypher code?
Is there something wrong in the process? Is there someone here that work with this - SNB and Neo4j?
Is there someone here to help me in this problem?
I tried to explain my problem but if you have questions about my problem, feel free to ask.
Thank you for your time
The problem is with the headers in the second file. If you want to embed periods . in the header column names you need to back tick the columns when you reference them in the load csv statement.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM ".../forum_containerOf_post_0.csv" AS csvLine
FIELDTERMINATOR "|"
CREATE (:FCOP_0 {Forum.id:csvLine.`Forum.id`, Post.id:csvLine.`Post.id`})
Yeah you got right in your answer, but with a little correction
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS
FROM ".../forum_containerOf_post_0.csv" AS csvLine
FIELDTERMINATOR "|"
CREATE (:FCOP_0 {`Forum.id`:csvLine.Forum.id, `Post.id`:csvLine.Post.id})
But I discover other problem. This creates me the FCOP_0 node label but without the properties that the forum_containerOf_post_0.csv have. The two properties are Forum.id and Post.id, but with this process the properties are not loaded to the respectives nodes...it creates the FCOP_O node label in Neo4j but their nodes don't have properties, those two properties.
Can you please help me?
I have a database built from CSV files loaded from an external source. For some reason, an ID number in many of the tables is loaded into the CSV / database encased in single quotes - here's a sample line:
"'010010'","MARSHALL MEDICAL CENTER NORTH","8000 ALABAMA HIGHWAY 69","","","GUNTERSVILLE","AL","35976","MARSHALL","2565718000","Acute Care Hospitals","Government - Hospital District or Authority","Yes"
Is there any SQL I can run on the already-established database to strip these single quotes, or do I have to parse every CSV file and re-import?
I believe the following would do it (test it first):
UPDATE U
SET YourID = REPLACE(YourID, '''', '')
FROM MyTable AS U
WHERE YourID LIKE '''%'''
If it works right, do a full backup before running it in production.
We building big Web Application and we use mysql, we want to make mysql database more fast.
Some of us think if we will put message html body inside table and not inside text.txt in will make database heavy and not fast.
Thanks,
*Part of main table that hold message:
option 1:hold html message body inside database
message {
id (int)
subject (varchar)
body (text)
}
option 2: hold html message body inside body1.txt file
message {
id (int)
subject (varchar)
file_body_path (varchar)
}
*
If you:
Don't need transactional control over the contents of your files and
Only treat the files as an atomic entity (i. e. don't parse them, search for their content etc.)
, then you better store them out of the database.
The HTTP server will serve the disk-based files much faster to begin with.
As Quassnoi correctly points out, the webserver will most likely be faster serving txt files than data from the DB ...
BUT: This only works if the webserver doesn't have to run any searches/queries against the DB to build the links between the TXT files.
Think of these use cases:
remove a text file
add a text file
add a link to a text file
remove a link from a text file
find a text passage within a text file.
each of these use cases will require your parsing the TXT file and maintaining all the needed links in the 'index-pages'. How will you do this in your content management system?