I need some help from you with a problem of understanding refrencing data from hive. The following situation: I have a CSV fil data.csv imported into hadoop. Now I have found many snippets that use an external table to create a schema on top of the csv file. My question is, how does hive know that the schema of the external table is connected to data.csv. In examples I cannot find a reference to the csv file.
Where is sample_1.csv referenced for usage in this hive example or how does hive know that data from sample_1.csv includes the data?
While creating external table we have to give the list of columns and hdfs location. Hive will store only column metadata like column name, datatype.. and the hdfs location.
When we execute query on external table it will fetch metadata and then fetch available files from hdfs location.
now we've got the answer. The manual recommends to store one file in one directory. When we then build an external table on top it seems that the data ist identified by the schema.
In my Testcase i have imorted 3 csv files with one schema 2 files got the matching schema. The third file got one column more. If i run a query the data of all three files are shown. The additional column from the third file is missing.
Everything is fine now - thank you!
Related
I am aiming at having a clickhouse table that is constantly updating rows and easy to export. I am considering having a Clickhouse table reference a path to a CSV file, similar to how dictionaries can reference an Absolute Path to a file under its source.
Is there a way to have it update accordingly to a CSV file? Instead of having to update rows all the time?
According to the documentation, when we create an EXTERNAL table in HIVE, and then DROP the table, the metadata is updated and the data that was loaded in the HDFS directory /user/hive/warehouse//> still exists?
I have two questions :
1. How do you do clean-up of the files in the /user/hive/warehouse//>?
2. When I tried to create the table again and the files are the same name but the data is different, HIVE warehoouse files did not get updated
Should it be?? (I asked this since I am not sure if this is a set-up issue or an expected behavior)
Hive doesn't store (manage) any data files for EXTERNAL tables in the warehouse directory. It only stores the metadata for these tables in the Metastore.
This is the main difference between Hive internal (managed) and external tables. Internal table owns the data, external table only knows about it.
More detailed explanation here.
To delete EXTERNAL table data, you need to delete it manually from HDFS location, Hive only deletes metadata in this case.
To delete HDFS files, you can use simply rm command:
hadoop fs -rm /location_of_data
and use -rm -R if want to delete recursively.
I have a scenario where my source can be on different versions of our database as a result the in source file I could have different number of columns while my destination have defined number of columns.
now
what we are trying to do is:
load data from source to flat files. move them to central server and
then load that data into central database. but if any column is
missing in flat file i need to add derived column.
what is the best way to do this?? how can i dynamically add derived columns?
You can either do this with BiMLScript as other have suggested in comments, or you can write a script task that reads the file, analyzes the contents, and imports it. Yet another option would be to bulk import the file as is to a staging table (that would have to be dropped and re-created everytime) and write a stored procedure that analyzes the DDL and contents, and imports data to the destination table.
I did some research on how to import XML data into MySQL possibly with the Workbench.
However, I was unable to find any easy tutorial how to do that. I have 6 XML files and all contain data, no schema.
From what I understood, the process consists of 2 parts:
1.Making the table (this is the part which is unclear to me) - is there a way to make the table from only XML data file?
2.Importing the data to the MySQL table. I think I understand this one, it could be done by executing this query:
LOAD XML LOCAL INFILE '/pathtofile/file.xml'
INTO TABLE my_tablename(personal_number, firstname, ...);
I've done it before where I read the XML files into a MySQL database where the data type was set to BLOB.
I am trying to import a large HDFS file into a mysql db. The data in the file is delimiter by a '^A'. How do I tell mysql to separate each column by ctrl-A? Also, is it possible for me to specify what fields I want to import.
See the documentation here:
http://dev.mysql.com/doc/refman/5.5/en/mysqlimport.html
You are looking for the --fields-terminated-by=string option. There is not option to only select certain fields for import, though you can use --columns=column_list to map columns in your data to fields in the table.