I have multiple csv files with details of people. I copy this into HDFS using the -copyFromLocal command and I view it using Hive table. But now my new use case is that these csv files in my local gets updated daily I want these data to be updated in HDFS just like the way Sqoop Inceremental import works which copies data from RDBMS to HDFS. Is there any way to do it and suggest me how to do it.
Assuming every file contains the same fields.
Create a single top level HDFS directory, put date partitions for every day
/daily_import
/day=20180704
/file.csv
/day=20180705
/file.csv
Then define a table over it
CREATE EXTERNAL TABLE daily_csv (
...
) PARTITIONED BY (`day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' STORED AS TEXT -- Use CsvSerde instead!
LOCATION '/daily_import'
;
Then every day after you copy files into the appropriate HDFS location, execute a metastore refresh for new partitions
MSCK REPAIR TABLE daily_csv;
Related
I have created 4 tables (a,b,c,d) in hive and created a view (x) on top of that tables by joining them.
-- How can i export the x underlying csv data from hdfs to local ?
-- How can i keep this csv in hdfs
for tables , we can do show create table a;
this will show the location of the hdfs where the underlying csv is stored.
hadoop fs get --from source_path_and_file --to dest_path_and_file
similarly how can i get the csv data from view into my local.
You can export view data to the CSV using this:
insert overwrite local directory '/user/home/dir' row format delimited fields terminated by ',' select * from view;
Concatenate files in the local directory if you need single file using cat :
cat /user/home/dir/* > view.csv
Alternatively if the dataset is small, you can add order by in the query, this will trigger single reducer and produce single ordered file. This will perform slow if the dataset is big.
1) to write your results in file you can use INSERT OVERWRITE as below:
insert overwrite local directory '/tmp/output'
row format delimited
fields terminated by '|'
select * from <view>;
2) If you want to write a file into HDFS then use above insert overwrite statement with local
3) No separate HDFS location for views.
View are purely logical construct from the table and there is no separate underlying storage created for them in HDFS.
Views are being used when you want to store intermediate results and query them directly instead of writing complex query on that table again and again. It's like we use with blocks in our query.
I have two CSV-files that I uploaded to the Azure Blob Storage within HDInsight. I can upload these two files to the cluster without problems. I then create two Hive-tables with...
CREATE EXTERNAL TABLE IF NOT EXISTS hive_table1(id int, age string, date string...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' STORED AS TEXTFILE LOCATION '/user/hive/warehouse'
Similar syntax goes for the other table.
Now I want to load the first CSV-file into the first table and the second CSV-file into the second table (resulting in non-corresponding columns).
I use...
LOAD DATA INPATH '/file/file1.csv' OVERWRITE INTO TABLE hive_table1;
...and am able to load the CSV-file data into the first table. But..., not only is the first data set loaded into the first Hive table, it also loads the exact same file's data into the second Hive table.
Obviously, I only want to have the first data set loaded into one table and the second distinct data set only into the other table.
Can anyone help pointing out errors or contribute with a possible solution?
Thanks in advance.
It looks like you just need to specify a different 'LOCATION' for the second table. When you do the 'LOAD DATA', Hive is actually copying data into that path. If both tables have the same 'LOCATION', they will share the same data.
Your location is what creating problem. You have given same location for both the tables. As the tables are external the file will be created directly under your path.
Also LOAD DATA INPATH '/file/file1.csv' OVERWRITE INTO TABLE hive_table1; will overwrites the already existing file. This is what happening with your tables. As Farooque mentioned for different tables the location should be unique to get the desired results.
I see you are creating external table and creating 2 tables having single files each.
You have to follow the simple steps as below:
Create table
CREATE EXTERNAL TABLE IF NOT EXISTS hive_table1(id int, age string, date string...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' STORED AS TEXTFILE LOCATION '/user/hive/warehouse/table1_dir/'
Copy file to HDFS location
hdfs dfs -put '/file/file1.csv' '/user/hive/warehouse/table1_dir/'
Similary for second table
Create table
CREATE EXTERNAL TABLE IF NOT EXISTS hive_table2(id int, age string, date string...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' STORED AS TEXTFILE LOCATION '/user/hive/warehouse/table2_dir/'
Copy file to HDFS location
hdfs dfs -put '/file/file2.csv' '/user/hive/warehouse/table2_dir/'
Note: If you are using more than one table, then their location should be unique.
I am really quite the amateur. I am trying to automate the import of csv data into a table that resides in Hadoop. The csv file would reside in a server. I have been googling, it seems that i would have to write a shell script to upload the csv file into HDFS and then write a hive script to import the csv into the table. All the scripts can be dumped to Oozie in a workflow to automate this. Is this right? Is there a better way? Could someone point me towards the right track.
To put a file to hdfs :
hadoop fs -put /here/the/local/file.csv /here/the/destination/in/HDFS
To create a Hive table base on a csv :
CREATE TABLE yourTable(Field1 INT, Field2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 'youSeparator';
And once you have created your table :
LOAD DATA INPATH 'HDFS/Path/To:YourFile.csv' INTO TABLE yourTable;
And yes you can do it with a Oozie Workflow or in Java for example ...
The way I have been doing it is with an sql file and a cron job. The sql consists of loading the data into a table, then doing some other operations on it as needed.
The file consists of the same sql you would input into the Hive CLI. You run it from the command line (or as a cron job) with 'hive -f '.
Hope that helps.
I have tables that are on different mysql instances. I want to export some data as csv from a mysql instance, and perform a left join on a table with the exported csv data. How can I achieve this?
Quite surprisingly that is possible with MySQL, there are several steps that you need to go through.
First create a template table using CSV engine and desired table layout. This is the table into which you will import your CSV file. Use CREATE TABLE yourcsvtable (field1 INT NOT NULL, field2 INT NOT NULL) ENGINE=CSV for example. Please note that NULL values are not supported by CSV engine.
Perform you SELECT to extract the CSV file. E.g. SELECT * FROM anothertable INTO OUTFILE 'temp.csv' FIELDS TERMINATED BY ',';
Copy temp.csv into your target MySQL data directory as yourcsvtable.CSV. Location and exact name of this file depends on your MySQL setup. You cannot perform the SELECT in step 2 directly into this file as it is already open - you need to handle this in your script.
Use FLUSH TABLE yourcsvtable; to reload/import the CSV table.
Now you can execute your query against the CSV file as expected.
Depending on your data you need to ensure that the data is correctly enclosed by quotation marks or escaped - this needs to be taken into account in step 2.
CSV file can be created by MySQL on some another server or by some other application as long as it is well-formed.
If you export it as CSV, it's no longer SQL, it's just plain row data. Suggest you export as SQL, and import into the second database.
I'm a newbie here trying to import some data into my wordpress database (MySQL) and I wonder if any of you SQL experts out there can help?
Database type: MySQL
Table name: wp_loans
I would like to completely replace the data in table wp_loans with the contents of file xyz.csv located on a remote server, for example https://www.mystagingserver.com/xyz.csv
All existing data in the table should be replaced with the contents of the CSV file.
The 1st row of the CSV file is the table headings so can be ignored.
I'd also like to automate the script to run daily at say 01:00 in the morning if possible.
UPDATE
Here is the SQL I'm using to try and replace the table contents:
LOAD DATA INFILE 'https://www.mystagingserver.com/xyz.csv'
REPLACE
INTO TABLE wp_loans
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
IGNORE 1 LINES
I would recommend a cron job to automate the process, and probably use BCP (bulk copy) to insert the data into a table... But seeing as you are using MySQL, instead of BCP, try load data in file - https://mariadb.com/kb/en/load-data-infile/