Importing data from geonames.org database into MySQL DB - mysql

Does anyone how to import a geonames.org data into my database? The one i'm trying to import is http://download.geonames.org/export/dump/DO.zip, and my DB its a MySQL db.

I found the following by looking in the readme file included in the zip file you linked to in the section called "The main 'GeoName' table has the following fields:"
First create the database and table on your MySQL instance. The type of fields are given in each row of the section I just quoted the title of above.
CREATE DATABASE DO_test;
CREATE TABLE `DO_test`.`DO_table` (
`geonameid` INT,
`name` varchar(200),
`asciiname` varchar(200),
`alternatenames` varchar(5000),
`latitude` DECIMAL(10,7),
`longitude` DECIMAL(10,7),
`feature class` char(1),
`feature code` varchar(10),
`country code` char(2),
`cc2` char(60),
`admin1 code` varchar(20),
`admin2 code` varchar(80),
`admin3 code` varchar(20),
`admin4 code` varchar(20),
`population` bigint,
`elevation` INT,
`gtopo30` INT,
`timezone` varchar(100),
`modification date` date
)
CHARACTER SET utf8;
After the table is created you can import the data from the file. The fields are delimited by tabs, rows as newlines:
LOAD DATA INFILE '/path/to/your/file/DO.txt' INTO TABLE `DO_test`.`DO_table`;

I have made recently a shell script that downloads the latest data from geonames site and imports them into a MySQL database. It is based on the knowledge at GeoNames Forum and saved me a lot of time.
It is in its first version but is fully functional. Maybe it can help.
You can access it at http://codigofuerte.github.com/GeoNames-MySQL-DataImport/

For every one in the future :
On geonames.org forum in the year 2008, this is "import all geonames dump into MySQL"
http://forum.geonames.org/gforum/posts/list/732.page
Also google this : import dump into [postgresql OR SQL server OR MySQL] site:forum.geonames.org
To find more answers even from the year 2006
Edited to provide a synopsis:
In the geoname official read me : http://download.geonames.org/export/dump/. We will find a good description about the dump files and contents of them.
Dump files will be imported to the MySQL datatable directly. for example :
SET character_set_database=utf8;
LOAD DATA INFILE '/home/data/countryInfo.txt' INTO TABLE _geo_countries IGNORE 51 LINES(ISO2,ISO3,ISO_Numeric,FIPSCode,AsciiName,Capital,Area_SqKm,Population,ContinentCode,TLD,CurrencyCode,CurrencyName,PhoneCodes,PostalCodeFormats,PostalCodeRegex,Languages,GeonameID,Neighbours,EquivalentFIPSCodes);
SET character_set_database=default;
be careful about the characterset because if we use the CSV LOAD DATA ready importer of an old phpmyadmin of 2012 we may lose the utf characters even if the collation of columns was set to utf8_general_ci
Currently there are 4 essential datatables : continents, countries(countryInfo.txt), divisions(admin1), cities or locations(geonames)
admin1, 2, 3, 4 dump files are the different levels of internal divisions of countries such as admin 1 which is the states of US or provinces of other countries. admin 2 is more detailed and is the internal divisions of the state or the province. and so on for the 3 and 4
The countries dump files have been listed there contain not only cities but all the locatoins in that country even including a store center. Also there is a huge file as "allCountries.txt" will be more than 1GB after extracting from zip file. If we want only the cities we should choose one of the dump files : cities1000.txt , cities5000.txt , cities15000.txt which the numbers represent the min population of the listed cities. We store cities in the geonames datatable(you may call it geo locations or geo cities).
Before importing *.txt dump files take a few research about the LOAD DATA syntax in the MySQL documentation.
The read me text file(also in the footer of dump page) provides enough description for example :
The main 'geoname' table has the following fields :
---------------------------------------------------
geonameid : integer id of record in geonames database
name : name of geographical point (utf8) varchar(200)
asciiname : name of geographical point in plain ascii characters, varchar(200)
alternatenames : alternatenames, comma separated varchar(5000)
latitude : latitude in decimal degrees (wgs84)
longitude : longitude in decimal degrees (wgs84)
feature class : see http://www.geonames.org/export/codes.html, char(1)
feature code : see http://www.geonames.org/export/codes.html, varchar(10)
country code : ISO-3166 2-letter country code, 2 characters
cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 60 characters
admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
admin3 code : code for third level administrative division, varchar(20)
admin4 code : code for fourth level administrative division, varchar(20)
population : bigint (8 byte int)
elevation : in meters, integer
dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone : the timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format
Also about the varchar(5000) we should know about the 64kb size of each row in MySQL 5.0 or later:
Is a VARCHAR(20000) valid in MySQL?

This is my note after I imported successfully.
As of writing I was testing with MySQL 5.7.16 on Windows 7. Follow these steps to import:
Download desired data file from the official download page. In my case I chose cities1000.zip because it's much smaller in size (21MB) than the all-inclusive allcountries.zip (1.4GB).
Create the following schema and table according to readme.txt on the download page, where the fields are specified below the text "the main 'geoname' table has the following fields".
CREATE SCHEMA geonames DEFAULT CHARSET utf8 COLLATE utf8_general_ci;
CREATE TABLE geonames.cities1000 (
id INT,
name VARCHAR(200),
ascii_name VARCHAR(200),
alternate_names VARCHAR(10000) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
latitude DECIMAL(10, 7),
longitude DECIMAL(10, 7),
feature_class CHAR(1),
feature_code VARCHAR(10),
country_code CHAR(2),
cc2 CHAR(60),
admin1_code VARCHAR(20),
admin2_code VARCHAR(80),
admin3_code VARCHAR(20),
admin4_code VARCHAR(20),
population BIGINT,
elevation INT,
dem INT,
timezone VARCHAR(100),
modification_date DATE
)
CHARACTER SET utf8;
Field names are arbitrary as long as the column size and field types are the same as specified. alternate_names are specially defined with the character set utf8mb4 because the values for this column in the file contain 4-byte unicode characters which are not supported by the character set utf8 of MySQL.
Check the values of these parameters: character_set_client, character_set_results, character_set_connection. 7
SHOW VARIABLES LIKE '%char%';
If they are not utf8mb4, then change them:
SET character_set_client = utf8mb4;
SET character_set_results = utf8mb4;
SET character_set_connection = utf8mb4;
Import data from file using LOAD DATA INFILE ...
USE geonames;
LOAD DATA INFILE 'C:\\ProgramData\\MySQL\\MySQL Server 5.7\\Uploads\\cities1000.txt' INTO TABLE cities1000
CHARACTER SET utf8mb4 (id, name, ascii_name, alternate_names, latitude, longitude, feature_class, feature_code,
country_code, cc2, admin1_code, admin2_code, admin3_code, admin4_code, population, #val1,
#val2, timezone, modification_date)
SET elevation = if(#val1 = '', NULL, #val1), dem = if(#val2 = '', NULL, #val2);
Explanation for the statement:
The file should be placed under a designated location by MySQL for importing data from files. You can check the location with SHOW VARIABLES LIKE 'secure_file_priv';. In my case it's C:\ProgramData\MySQL\MySQL Server 5.7\Uploads. In Windows you need to use double slashes to represent one slash in the path. This error would be shown when the path is not given correctly: [HY000][1290] The MySQL server is running with the --secure-file-priv option so it cannot execute this statement.
With CHARACTER SET utf8mb4 you're telling MySQL what encoding to expect from the file. When this is not given explicitly, or the column encoding is not utf8mb4, an error prompt like this will be seen: [HY000][1300] Invalid utf8 character string: 'Gorad Safija,SOF,Serdica,Sofi,Sofia,Sofiae,Sofie,Sofii,Sofij,Sof'. 5 In my case I found it's due to the existence of Gothic letters in the alternate names, such as 饜崈饜崏饜崋饜尮饜尠 (id 727011), 饜尯饜尶饜崅饜尮饜崉饜尮饜尡饜尠 (id 3464975), and 饜尯饜崏饜尳饜尭饜尨饜崁饜尭饜尮饜崏饜尳 (id 3893894). These letters need to be stored as 4-byte characters (utf8mb4) while my then encoding was utf8 which only supports up to 3-byte characters. 6 You can change column encoding after the table is created:
ALTER TABLE cities1000 MODIFY alternate_names VARCHAR(10000) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
To check the encoding of a column:
SELECT character_set_name, COLLATION_NAME FROM information_schema.COLUMNS WHERE table_schema = 'geonames' AND table_name = 'cities1000' AND column_name = 'alternate_names';
To test if the characters can be stored:
UPDATE cities1000 SET alternate_names = '饜崈饜崏饜崋饜尮饜尠' WHERE id = 1;
Values for some columns need to be "improved" before they are inserted, such as elevation and dem. They are of type INT and values for them from the file could be empty strings, which can't be stored by an INT type column. So you need to convert those empty strings to null for those columns. The latter part of the statement just serves this purpose. This error would be shown when the values are not property converted first: [HY000][1366] Incorrect integer value: '' for column 'elevation' at row 1. 3, 4
References
http://www.geonames.org/
http://download.geonames.org/export/dump/
https://dev.mysql.com/doc/refman/8.0/en/load-data.html
https://dba.stackexchange.com/a/111044/94778
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-conversion.html
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
https://stackoverflow.com/a/35156926/4357087
https://dev.mysql.com/doc/refman/5.7/en/charset-connection.html

Related

Loading a huge CSV file with high performance in Oracle table

I have a CSV File which its size is about 20 Gig. The file has three Persian columns. I want to load it to Oracle Table. I searched and found that sql loader has high performance. But, when I load the file in the table, Persian data is not loaded in the right order. In fact, it is because Persian data is the right to left language.
I use this control file:
OPTIONS (SKIP=0, ERRORS=500, PARALLEL=TRUE, MULTITHREADING=TRUE, DIRECT=TRUE,
SILENT=(ALL))
load data
CHARACTERSET UTF8
infile '/home/oracle/part.csv'
APPEND
into table Fact_test
fields terminated by ','
trailing nullcols(
A_ID INTEGER,
T_ID,
G_ID,
TRTYPE,
ORRETURNED,
MECH,
AMN,
TRAM INTEGER,
USERID INTEGER,
USERS INTEGER,
VERID INTEGER,
TRSTAMP CHAR(4000),
OPR_BRID INTEGER
)
File is like this:
A_ID,T_ID,g_id,TrType,ORRETURNED,Mech,Amn,Tram,UserID,UserS,VerID,TRSTAMP,OPR_BRID
276876739075,154709010853,4302,亘乇賵賮賯,丕氐賱蹖,睾蹖乇 爻亘讴,亘乇乇爻蹖,86617.1,999995,NULL,NULL,1981-11-16 13:23:16,2516
When I export the table in excel format, I receive this, some numbers become negative:
(A_ID,T_ID,g_id,TrType,ORRETURNED,Mech,Amn,Tram,UserID,UserS,VerID,TRSTAMP,OPR_BRID) values (276876739075,'154709010853',411662402610,'4302','睾蹖乇 亘乇乇爻蹖','丕氐賱賷','爻亘讴',-1344755500,-1445296167,-1311201320,909129772,'77.67',960051513);
The problem is when the data loaded, some columns have negative number and order of some columns change.
Would you please guide me how to solve the issue?
Any help is really appreciated.
Problem solved:
I change the control file to this one:
load data
CHARACTERSET UTF8
infile '/home/oracle/test_none.csv'
APPEND
into table Fact_test
FIELDS terminated BY ','
trailing nullcols(
A_ID CHAR(50),
T_ID CHAR(50),
G_ID CHAR(50),
TRTYPE,
ORRETURNED,
MECH,
AMN,
TRAM CHAR(50),
USERID,
USERS CHAR(50),
VERID CHAR(50),
TRSTAMP,
OPR_BRID CHAR(50)
)

inner join two datasets but return nothing without any error (date format issue)?

I'm new to SQL, currently I'm doing a task about join two datasets, one of the dataset was created by myself, here's the query I used:
USE `abcde`;
CREATE TABLE `test_01`(
`ID` varchar(50) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
`NUMBER01` bigint(20) NOT NULL DEFAULT '0',
`NUMBER02` bigint(20) NOT NULL,
`date01` date DEFAULT NULL,
PRIMARY KEY (`ID`, `date01`))
Then I load the data from a csv file to this table, the csv file looks like this:
ID NUMBER01 NUMBER02 DATE01
aaa=ee 12345678 235896578 **2009-01-01T00:00:00**
If I query this newly-created table, it looks like this(the format of the 'DATE01' changes):
ID NUMBER01 NUMBER02 DATE01
aaa=ee 12345678 235896578 **2009-01-01**
Another dataset, I queried and exported to a csv file, the format of the date01 column is like 01/12/1979 and in SQL the format looks like 1979-12-01.
I also usedselect * from information_schema.columns to check the datatype of the columns I need to join, for the newly-created dataset:
The date column for another dataset is:
The differences are:
1. The format of the date column in csv appears different
2. The COLUMN_DEFAULT are different, one is 0000-00-00, another one is NULL.
I wonder the reason why I got empty output is probably because the difference in the 'date' format, but I'm not sure how to make them the same so that I can get something in the output, can someone gave me some hint? Thank you.
the format of the 'DATE01' changes
Of course, DATE datatype does not contain timezone info/component.
I wonder the reason why I got empty output is probably because the difference in the 'date' format
If input value have some disadvantage (like wrong data format) than according value is truncated or is set to NULL. See - you must obtain a bunch of warnings during the importing similar to "truncate incorrect value".
If the date field in CSV have wrong format then you must use intermediate user-defined variable for accepting raw value, and apply proper converting expression to it in SET clause. Like
LOAD DATA INFILE ...
INTO TABLE tablename (field1, ..., #date01)
SET date01 = STR_TO_DATE(#date01, '%d/%m/%Y');

How to store a .docx file into MySQL - and open it?

MySQL
CREATE TABLE document_control (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
person VARCHAR(40),
dateSent TIMESTAMP,
fileAttachment MEDIUMBLOB
);
MySQL Insert record query
INSERT INTO DOCUMENT_CONTROL (fileattachment) values (load_file('C:\Users\<user>\Desktop\test.docx'));
Retrieving record
If I run this query here: SELECT * FROM document_control - Everything is null - even after the insert query above.
Question
Why is the values null? and also.. how can I properly store a .docx file into MySQL and open the file?
You need to look into SQL blob data type
You could also read the file as bytes, convert it into a string or base64 encoding or something, and then save that as string in database.
You could also choose to save the file-reference (file path of file) to refer to it.

MySQL MD5 LONGTEXT with Binary Data

Is there a way to get the md5 hash of a binary value that's stored in a MySQL LONGTEXT field?
Example:
CREATE TABLE table_name (
data_path VARCHAR(255) NOT NULL;
data_column LONGTEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci,
PRIMARY KEY (path)
)
Then in application code (example PHP):
function insert($path) {
// ...
$raw_file_data = file_get_contents($path);
$stmt = $dbh->prepare(
"INSERT INTO REGISTRY (data_path, data_column) VALUES (:path, :data)"
);
$stmt->bindParam('path', $path);
$stmt->bindParam('data', $raw_file_data);
$stmt->execute();
// ...
}
insert('/path/to/binary_file.jpg');
insert('/path/to/text_file.text');
Later, we query the md5 hashes of the inserted rows:
SELECT md5(data_column) WHERE data_path = '/path/to/binary_file.jpg';
SELECT md5(data_column) WHERE data_path = '/path/to/text_file.txt';
However hashes from MySQL do not match the md5sum of the actual files for any non-plaintext files.
md5sum /path/to/binary_file.jpg # does not match!
md5sum /path/to/text_file.txt # matches!
As far as I understand this has to do with the way MySQL will encode the data for the column's character set.
I also understand this field should be a binary field (BLOB, LONGBLOG, etc.) but this is in a legacy system which uses the same table to store binary and text files and depends on being able to search those text files.
My question is: Is there a way to get the md5 hash of the binary value of what is stored in the data_column?

SQL - how to return exact matches only (special characters)

I have a table with words in spanish (INT id_word,VARCHAR(255) word). Lets suppose the table has these records:
1 casa
2 pantalon
If I search for the word pantal贸n (with a special char 贸) it should not return any rows. How do I select exact matches only? It is currently returning the 2nd row.
SELECT * FROM words WHERE word='pantal贸n';
Thanks!
Solution from ifx, i changed the word field's collation to utf8_bin.
The reason this happens is down to the collation. There are collations that are accent sensitive (which you want in this case) and other that are accent insensitive (which is what you currently have configured). There are also case-sensitive and case-insensitive collations.
The following code produces the correct result:
create table test (
id int identity(1,1),
value nvarchar(100) collate SQL_Latin1_General_Cp437_CI_AS
)
insert into test values ('casa')
insert into test values ('pantalon')
select value collate SQL_Latin1_General_Cp437_CS_AS from test where value = 'pantal贸n'
The below code produces the incorrect result:
drop table test
go
create table test (
id int identity(1,1),
value nvarchar(100) collate SQL_Latin1_General_Cp437_CI_AI
)
insert into test values ('casa')
insert into test values ('pantalon')
select value collate SQL_Latin1_General_Cp437_CS_AS from test where value = 'pantal贸n'
The key here is the collation - AI means Accent-insensitive, AS means accent-sensitive.
i have this problem in our language too, so i did this, i have 2 coulmns for names, one of the i have named SearchColumn and the other one ViewColumn, when saving data I replace Special characters with other characters. when a user wants to search for something with the same function I do the changes and search it in the SearchColumn, if the search matches, I would display the value of the ViewColumn.