I use requests to pull json files of companies. How I add a ticker column and json string in a csv file (separated by comma) so I can import the csv file into postgresql?
My python code:
ticker_list = ['AAPL','MSFT','IBM', 'APD']
for ticker in ticker_list:
url_profile = fmp_url + profile + ticker + '?apikey=' + apikey
#get data in json array format
json_array = requests.get(url_profile).json()
# for each record within the json array, use json.dumps to turn it into an json string.
json_str = [json.dumps(element) for element in json_array]
#add a ticker colum and write both ticker and json string to a csv file:
with open ("C:\\DATA\\fmp_profile_no_brackets.csv","a") as dest:
for element in json_str:
dest.writelines (ticker_str + ',' + element + '\n' )
In postgres I have table t_profile_json with 2 columns:
ticker varchar(20) and profile jsonb
when I copy the file fmp_profile into postgres by using:
copy fmp.t_profile_json(ticker,profile) from 'C:\DATA\fmp_profile.csv' delimiter ',';
I have this error:
ERROR: extra data after last expected column
CONTEXT: COPY t_profile_json, line 1: "AAPL,{"symbol": "AAPL", "price": 144.49, "beta": 1.219468, "volAvg": 88657734, "mktCap": 22985613828..."
SQL state: 22P04
The copy command seems to add both "AAPL, json string.." as one string.
I did something wrong at the "dest.writelines (ticker_str + ',' + element + '\n' )", but I don't know how to correct it.
Thank you so much in advance for helping!
Related
I'm trying to import a large geojson file into a Postgres table. In order to do so, I first convert the json into csv with python:
import pandas as pd
df = pd.read_json('myjson.txt')
df.to_csv('myjson.csv',sep='\t')
The resulting csv looks like:
name type features
0 geobase FeatureCollection {'type': 'Feature', 'geometry': {'type': 'LineString', 'coordinates': [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}, 'properties': {'ID_TRC': 1010001, 'DEB_GCH': 12320, 'FIN_GCH': 12340}}
The first three lines in json file were:
{"name":"geobase","type":"FeatureCollection"
,"features":[
{"type":"Feature","geometry":{"type":"LineString","coordinates":[[-73.7408048408216,45.5189595588608],[-73.7408749973688,45.5189893490944],[-73.7409267622838,45.5190212771795],[-73.7418867072278,45.5196401086022],[-73.7419636417947,45.5196917400376]]},"properties":{"ID_TRC":1010001,"DEB_GCH":12320,"FIN_GCH":12340}}
Following that, the copy command into my postgres table is:
psql -h (host) -U (user) -d (database) -c "\COPY geometries.geobase_tmp(id,name,type,features) FROM '.../myjson.csv' with (format csv,header true, delimiter E'\t');"
results in my table filled with name,type and features. First feature (a text field) is for example the following string:
{'type': 'Feature', 'geometry': {'type': 'LineString', 'coordinates': [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}, 'properties': {'ID_TRC': 1010001, 'DEB_GCH': 12320, 'FIN_GCH': 12340}}
In Postgres, when I try to read from this tmp table into another one:
SELECT features::json AS fc FROM geometries.geobase_tmp
I get the error :
SQL Error [22P02]: ERROR: invalid input syntax for type json
Detail : Token "'" is invalid.
Where : JSON data, line 1: {'...
It's like if Postgres expects double quotes and not single quotes to parse the json text. What can I do to avoid this problem?
EDIT: I followed the procedure described here (datatofish.com/json-string-to-csv-python) to convert json to csv. The source (the json txt file) is a valid json and contains only double quotes. After conversion, it's not a valid json anymore (it contains single quotes instead of double quotes). Is there a way to output a csv while keeping the double quotes?
I figured it out:
Json to csv:
import pandas as pd
import json
import csv
df = pd.read_json('myjson.txt')
df['geom'] = df['features'].apply(lambda x:json.dumps(x['geometry']))
df['properties'] = df['features'].apply(lambda x:json.dumps(x['properties']))
df[['geom','properties']].to_csv('myjson.csv',sep='\t',quoting=csv.QUOTE_ALL)
Now CSV file looks like:
"" "geom" "properties"
"0" "{""type"": ""LineString"", ""coordinates"": [[-73.7408048408216, 45.5189595588608], [-73.7408749973688, 45.5189893490944], [-73.7409267622838, 45.5190212771795], [-73.7418867072278, 45.519640108602204], [-73.7419636417947, 45.5196917400376]]}" "{""ID_TRC"": 1010001, ""DEB_GCH"": 12320, ""FIN_GCH"": 12340}"
...
Postgres tmp table created with:
CREATE TABLE geometries.geobase_tmp (
id int,
geom TEXT,
properties TEXT
)
Copy CSV content into tmp table:
psql -h (host) -U (user) -d (database) -c "\COPY geometries.geobase_tmp(id,geom,properties) FROM 'myjson.csv' with (format csv,header true, delimiter E'\t');"
Creation of final postgres table which contains geometry and properties (each property in its own field):
drop table if exists geometries.troncons;
SELECT
row_number() OVER () AS gid,
ST_GeomFromGeoJSON(geom) as geom,
properties::json->'ID_TRC' AS ID_TRC,
properties::json->'DEB_GCH' AS DEB_GCH,
properties::json->'FIN_GCH' AS FIN_GCH
INTO TABLE geometries.troncons
FROM geometries.geobase_tmp
Complete Julia newbie here.
I'd like to do some processing on a CSV. Something along the lines of:
using CSV
in_file = CSV.Source('/dir/in.csv')
out_file = CSV.Sink('/dir/out.csv')
for line in CSV.eachline(in_file)
replace!(line, "None", "")
CSV.writeline(out_file, line)
end
This is in pseudocode, those aren't existing functions.
Idiomatically, should I iterate on 1:CSV.countlines(in_file)? Do a while and check something?
If all you want to do is replace a string in the line, you do not need any CSV parsing utilities. All you do is read the file line by line, replace, and write. So:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
out = open(outfile, "w+")
for line in readlines(infile)
newline = replace(line, "a", "b")
write(out, newline)
end
close(out)
This will replicate the pseudocode you have in your question.
If you need to parse and read the csv field by field, use the readcsv function in base.
data=readcsv(infile)
typeof(data) #Array{Any,2}
This will return the data in the file as a 2 dimensional array. You can process this data any way you want, and write it back using the writecsv function.
for i in 1:size(data,1) #iterate by rows
data[i, 1] = "This is " * data[i, 1] # Add text to first column
end
writecsv(outfile, data)
Documentation for these functions:
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.readcsv
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.writecsv
Since by default serde quotes fields by ", How can I not quote my fields using serde?
I tried:
row format serde "org.apache.hadoop.hive.serde2.OpenCSVSerde"
with serdeproperties(
"separatorChar" = ",",
"quoteChar" = "")
But i'm getting
FAILED: SemanticException java.lang.StringIndexOutOfBoundsException: String index out of range: 0
You could achieve this by specifying \u0000 as the quote character. Since quoteChar expects a string, you should use this unicode version of NULL.
ROW FORMAT SERDE
"org.apache.hadoop.hive.serde2.OpenCSVSerde"
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\u0000")
This unicode NULL \u0000 is what used by the CSV writer class as value for NO_QUOTE_CHARACTER: http://www.java2s.com/Code/Java/Development-Class/AverysimpleCSVwriterreleasedunderacommercialfriendlylicense.htm
For some reason "quoteChar" = "\u0000" didn't work for me as suggested in Nirmal's answer above.
When saving to file without quotes around the fields, I use:
-- saving to file
INSERT OVERWRITE LOCAL DIRECTORY 'file:/home/sidazhou/temp'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT *
FROM temp_table
;
PS. I know this isn't what's being asked, which concerns ROW FORMAT SERDE instead of ROW FORMAT DELIMITED FIELDS.
I am new to Apache spark and trying out a few POCs around this. I am trying to read json logs which are structured but a few fields are not always guaranteed, for example :
{
"item": "A",
"customerId": 123,
"hasCustomerId": true,
.
.
.
},
{
"item": "B",
"hasCustomerId": false,
.
.
.
}
}
Assume I want to transform these JSON logs into CSV, I was trying out Spark SQL to get hold of all the fields by simple Select statements but as the second JSON is missing a field(although it does has an identifier) I am not sure how can I handle this.
I want to transform the above json logs to
item, customerId, ....
A , 123 , ....
B , null/0 , ....
You should use SqlContext to read the JOSN file, sqlContext.read.json("file/path") But if you want to convert it into CSV and then you want to read it with missing values. Your CSV file should be look like
item,customerId,hasCustomerId, ....
A,123,, .... // hasCustomerId is null
B,,888, .... // customerId is null
i.e. empty record. Then you have to read this like
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("file/path")
I have a data-set in S3
123, "some random, text", "", "", 236
I build a external table on this dataset :
CREATE EXTERNAL TABLE db1.myData(
field1 bigint,
field2 string,
field3 string,
field4 string,
field5 bigint,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LOCATION 's3n://thisMyData/';
Problem/ Issue :
when I do
select * from db1.myData
field2 is shown as
some random
I need the field to be
some random, text
Gotcha's :
1. I cannot change the delimiter as there are over ~300 .csv files at this location
2. ESCAPED BY is not escaping the '\\'
3. I'm using HIVE 0.13 so there I cannot use CSV SerDe and neither i'm allowed to import new jars to cluster (its a complicated process to add a new jar as I have to go through Director level approvals)
Question:
Is there a workaround for making 'ESCAPED BY' come alive ?!
Any other workarounds for this ??
All suggestions are welcome !!
N.B : THis is not a repeat question. If you think its a repeat, please guide me to right page and I will take this off of this portal :)
I had to use: ESCAPED BY '\134' which translates to: ESCAPED BY '\'.
Additionally, because I was calling the Athena create table statement by passing in the statement from a JSON file I had to add an extra \ to mask the original \ in JSON. So my final statement within the JSON file looked like this: ESCAPED BY '\\134'.
If you are using Hive 0.14, you can use CSV Serde like this:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Refer below link for details:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde