unable to load csv from GCS bucket to BigQuery table accurately - csv

I am trying to load the airbnb_nyc data set from GCS bucket to BigqueryTable. Link to the dataset.
I am using the following Code:
def parse_file(element):
for line in csv.reader([element],delimiter=','):
return line
class DataIngestion2:
def parse_method2(self, values):
row1 = dict(
zip(('id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month',
'calculated_host_listings_count', 'availability_365'),
values))
return row1
with beam.Pipeline(options=pipeline_options) as p:
lines= p | 'Read' >> ReadFromText(known_args.input,skip_header_lines=1)\
| 'parse' >> beam.Map(parse_file)
pipeline2 = lines | 'Format to Dict _ original CSV' >> beam.Map(lambda x: data_ingestion2.parse_method2(x))
pipeline2 | 'Load2' >> beam.io.WriteToBigQuery(table_spec, schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
`
But my output on BigQuery Table is wrong.
I am only getting values for the first two columns and the rest of the 14 columns are showing NULL. I am not able to figure out what I am doing wrong. Can Someone Help me find the error in my logic. I basically want to know how to transfer a csv from GCS bucket to BigQuery through DataFlow pipeline.
Thank you,

You can use the ReadFromText method and then create your own transform by extending beam.DoFn. Attached the code below for reference.
https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText
Note that you can use gs:// for GCS in file_pattern.
More details about Pardo and DoFn
https://beam.apache.org/documentation/programming-guide/#pardo
import apache_beam as beam
from apache_beam.io.textio import ReadAllFromText,ReadFromText
from apache_beam.io.gcp.bigquery import WriteToBigQuery
from apache_beam.io.gcp.gcsio import GcsIO
import csv
COLUMN_NAMES = ['id','name','host_id','host_name','neighbourhood_group','neighbourhood','latitude','longitude','room_type','price','minimum_nights','number_of_reviews','last_review','reviews_per_month','calculated_host_listings_count','availability_365']
def files(path='gs:/some/path'):
return list(GcsIO(storage_client='<ur storage client>').list_prefix(path=path).keys())
def transform_csv(element):
rows = []
with open(element,newline='\r\n') as f:
itr = csv.reader(f, delimiter = ',',quotechar= '"')
skip_head = next(itr)
for row in itr:
rows.append(row)
return rows
def to_dict(element):
rows = []
for item in element:
row_dict = {}
zipped = zip(COLUMN_NAMES,item)
for key,val in zipped:
row_dict[key] =val
rows.append(row_dict)
yield rows
with beam.Pipeline() as p:
read =(
p
|'read-file'>> beam.Create(files())
|'transform-dict'>>beam.Map(transform_csv)
|'list-to-dict'>>beam.FlatMap(to_dict )
|'print'>>beam.Map(print)
#|'write-to-bq'>>WriteToBigQuery(schema=COLUMN_NAMES,table='ur table',project='',dataset='')
)
EDITED1 The ReadFromText supports \r\n as newline char.But,this fails to consider the condition where column data itself has \r\n. Updating the code below.
EDITED 2 GcsIo error fixed.
Note - I have used GCSIO for getting the list of files.
Details here
Please Up-vote and mark as answer if this helps.

Let me suggest another approch for this use case. BiqQuery offers special feature for uploading from Google Could Storage (GCS) to Bigquery. You can load data in several formats and CSV is among them.
There is nice tutorial on Google documentation explaining how to do it. You do not have to use Dataflow or apache_beam. Such process is available through BigQuery API itself.
This is working in many languages, but you do not have to use any language as such process can be done from console or via Cloud SDK using bq command. Everything can be found in mentioned tutorial.

Related

issue with connecting data in databricks from data lake and reading JSON into Folium

i'm working on something based of this blogpost:
https://python-visualization.github.io/folium/quickstart.html#Getting-Started
specifically part 13 - using Cloropleth maps:
the piece of code they use is the following:
import pandas as pd
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
state_unemployment = f"{url}/US_Unemployment_Oct2012.csv"
state_data = pd.read_csv(state_unemployment)
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.id",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Unemployment Rate (%)",
).add_to(m)
folium.LayerControl().add_to(m)
m
if I use this I get the requested map.
Now I try to do this with my own data; i work in databricks
so I have a JSON with the GEOJSON data (source_file1) and a CSV file (source_file2) with the data that needs to be "plotted" on the map.
source_file1 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/Belgie_GEOJSON.JSON"
state_geo = spark.read.json(source_file1,multiLine=True)
source_file2 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/DATASVZ.csv"
df_2 = spark.read.format("CSV").option("inferSchema", "true").option("header", "true").option("delimiter",";").load(source_file2)
state_data = df_2.toPandas()
when adjusting the code below:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=state_geo,
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
So i upload the geo_data parameter as a Sparkdatafram, I get the following error:
ValueError: Cannot render objects with any missing geometries: DataFrame[features: array<struct<geometry:struct<coordinates:array<array<array<string>>>,type:string>,properties:struct<arr_fr:string,arr_nis:bigint,arr_nl:string,fill:string,fill-opacity:double,name_fr:string,name_nl:string,nis:bigint,population:bigint,prov_fr:string,prov_nis:bigint,prov_nl:string,reg_fr:string,reg_nis:string,reg_nl:string,stroke:string,stroke-opacity:bigint,stroke-width:bigint>,type:string>>, type: string]```
I think it is because transforming the data from the "blob format" in the Azure datalake to the sparkdataframe, something goes wrong with the format. I tested this in a jupyter notebook from my desktop, data straight from file to folium and it all works.
If i load it directly from the source, like the example does with their webpage, so i adjust the 'geo_data' parameter for the folium function:
m = folium.Map(location=[48, -102], zoom_start=3)
folium.Choropleth(
geo_data=source_file1, #this gets adjusted directly to data lake
name="choropleth",
data=state_data,
columns=["State", "Unemployment"],
key_on="feature.properties.name_nl",
fill_color="YlGn",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="% Marktaandeel CC",
).add_to(m)
folium.LayerControl().add_to(m)
m
I get the error
Use "/dbfs", not "dbfs:": The function expects a local file path. The error is caused by passing a path prefixed with "dbfs:".
So I started wondering what is the difference between my JSON file and the one of the blogpost. And the only thing i can imagine is that the Azure datalake doesn't store my json as a json but as a block blob file and for some reason i am not converting it properly so that folium can read it.
Azure blob storage (data lake)
So can someone with folium knowledge let me know if
A. it is not possible to load the geo_data directly from a datalake ?
B. in what format I need to upload the data ?
any thoughts on this would be helpfull!!!
thanks in advance!
Solved this issue, just had to replace "dbfs:" with "/dbfs". I tried it a lot of times but used "/dbfs:" and got another error.
can't believe i'm this stupid :-)

JSONDecodeError: Expecting value: line 1 column 1 (char 0) while getting data from Pokemon API

I am trying to scrape the pokemon API and create a dataset for all pokemon. So I have written a function which looks like this:
import requests
import json
import pandas as pd
def poke_scrape(x, y):
'''
A function that takes in a range of pokemon (based on pokedex ID) and returns
a pandas dataframe with information related to the pokemon using the Poke API
'''
#GATERING THE DATA FROM API
url = 'https://pokeapi.co/api/v2/pokemon/'
ids = range(x, (y+1))
pkmn = []
for id_ in ids:
url = 'https://pokeapi.co/api/v2/pokemon/' + str(id_)
pages = requests.get(url).json()
# content = json.dumps(pages, indent = 4, sort_keys=True)
if 'error' not in pages:
pkmn.append([pages['id'], pages['name'], pages['abilities'], pages['stats'], pages['types']])
#MAKING A DATAFRAME FROM GATHERED API DATA
cols = ['id', 'name', 'abilities', 'stats', 'types']
df = pd.DataFrame(pkmn, columns=cols)
The code works fine for most pokemon. However, when I am trying to run poke_scrape(229, 229) (so trying to load ONLY the 229th pokemon), it gives me the JSONDecodeError. It looks like this:
So far I have tried using json.loads() instead but that has not solved the issue. What is even more perplexing is that specific pokemon has loaded before and the same issue was with another ID - otherwise I could just manually enter the stats for the specific pokemon that is unable to load into my dataframe. Any help is appreciated!
Because of the way the PokeAPI works, some links to the JSON data for each pokemon only load when the links end with a '/' (such as https://pokeapi.co/api/v2/pokemon/229/ vs https://pokeapi.co/api/v2/pokemon/229 - first link will work and the second will return not found). However, others will respond with a response error because of the added '/' so fixed the issue with a few if statements right after the for loop in the beginning of the function

Cloud Function running multiple times instead of once

I upload 10 files every day at 11 p.m with a Cron Job to a bucket on GCS. Each file is a .csv with a size from 2 to 30 KB. The file name is always YYYY-MM-DD-ID.csv
A Cloud Function is called everytime I am uploading a file into that bucket to send those .csv files to BigQuery. The trigger type is Cloud Storage on finalise/create events.
My issue is the following:
On BigQuery, each value for each row/columns is multiplied by a multiple. Sometimes it's 1 (so the value is the same), often 2 and sometimes 3. I attached one example bellow with the difference between BigQuery (BQ) and Google Cloud Storage (GCS).
It seems that the cloud function is called multiple times. It's not on the code but rather some duplicate message deliveries from the Cloud Function during the trigger. When I am going o the logs tab for today, I can see the Cloud Function upload_to_bigquery has been called multiple times.
I have tried to fix it but I made a mistake. I thought we could write temporary files to Cloud Functions but we can not. My solution was to write the filename I am uploading to BigQuery on a .txt file. And before to upload a new file on BigQuery, read that .txt file and check if the current file exist on that list. If the filename is already present, skip. Else, write the .txt filename to the list and do my stuff.
if file_to_upload not in text:
text.append(file_to_upload)
with open("all_uploaded_files.txt", "w") as text_file:
for item in text:
text_file.write(item + "\n")
bucket = storage_client.bucket('sfr-test-data')
blob = bucket.blob("all_uploaded_files.txt")
blob.upload_from_filename("all_uploaded_files.txt")
## do my things here
else:
print("file already uploaded")
# skip to new file to upload
But even if I could do that, this solution is not viable. The temporary file will become so large after months of years that it would be a mess. Do you know whats the easiest way to fix this issue?
Cloud Function: upload_to_big_query - main.py
BUCKET = "xxx"
GOOGLE_PROJECT = "xxx"
HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Last Non-Direct Click Conversions": "last_non_direct_click_conversions",
"Last Non-Direct Click Conversion Value": "last_non_direct_click_conversion_value",
"Last Click Prio Conversions": "last_click_prio_conversions",
"Last Click Prio Conversion Value": "last_click_prio_conversion_value",
"Data-Driven Conversions": "dda_conversions",
"Data-Driven Conversion Value": "dda_conversion_value",
"% Change in Conversions from Last Non-Direct Click to Last Click Prio": "last_click_prio_vs_last_click",
"% Change in Conversions from Last Non-Direct Click to Data-Driven": "dda_vs_last_click"
}
SPEND_HEADER_MAPPING = {
"Source/Medium": "source_medium",
"Campaign": "campaign",
"Spend": "spend"
}
tables_schema = {
"google-analytics": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("goal", bigquery.enums.SqlTypeNames.STRING, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_non_direct_click_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversions", bigquery.enums.SqlTypeNames.INT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversions", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_conversion_value", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("last_click_prio_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
bigquery.SchemaField("dda_vs_last_click", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE')
],
"google-analytics-spend": [
bigquery.SchemaField("date", bigquery.enums.SqlTypeNames.DATE, mode='REQUIRED'),
bigquery.SchemaField("week", bigquery.enums.SqlTypeNames.INT64, mode='REQUIRED'),
bigquery.SchemaField("source", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("medium", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("campaign", bigquery.enums.SqlTypeNames.STRING, mode='NULLABLE'),
bigquery.SchemaField("spend", bigquery.enums.SqlTypeNames.FLOAT64, mode='NULLABLE'),
]
}
def download_from_gcs(file):
client = storage.Client()
bucket = client.get_bucket(BUCKET)
blob = bucket.get_blob(file['name'])
file_name = os.path.basename(os.path.normpath(file['name']))
blob.download_to_filename(f"/tmp/{file_name}")
return file_name
def load_in_bigquery(file_object, dataset: str, table: str):
client = bigquery.Client()
table_id = f"{GOOGLE_PROJECT}.{dataset}.{table}"
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
schema=tables_schema[table]
)
job = client.load_table_from_file(file_object, table_id, job_config=job_config)
job.result() # Wait for the job to complete.
def __order_columns(df: pd.DataFrame, spend=False) ->pd.DataFrame:
# We want to have source and medium columns at the third position
# for a spend data frame and at the fourth postion for others df
# because spend data frame don't have goal column.
pos = 2 if spend else 3
cols = df.columns.tolist()
cols[pos:2] = cols[-2:]
cols = cols[:-2]
return df[cols]
def __common_transformation(df: pd.DataFrame, date: str, goal: str) -> pd.DataFrame:
# for any kind of dataframe, we add date and week columns
# based on the file name and we split Source/Medium from the csv
# into two different columns
week_of_the_year = datetime.strptime(date, '%Y-%m-%d').isocalendar()[1]
df.insert(0, 'date', date)
df.insert(1, 'week', week_of_the_year)
mapping = SPEND_HEADER_MAPPING if goal == "spend" else HEADER_MAPPING
print(df.columns.tolist())
df = df.rename(columns=mapping)
print(df.columns.tolist())
print(df)
df["source_medium"] = df["source_medium"].str.replace(' ', '')
df[["source", "medium"]] = df["source_medium"].str.split('/', expand=True)
df = df.drop(["source_medium"], axis=1)
df["week"] = df["week"].astype(int, copy=False)
return df
def __transform_spend(df: pd.DataFrame) -> pd.DataFrame:
df["spend"] = df["spend"].astype(float, copy=False)
df = __order_columns(df, spend=True)
return df[df.columns[:6]]
def __transform_attribution(df: pd.DataFrame, goal: str) -> pd.DataFrame:
df.insert(2, 'goal', goal)
df["last_non_direct_click_conversions"] = df["last_non_direct_click_conversions"].astype(int, copy=False)
df["last_click_prio_conversions"] = df["last_click_prio_conversions"].astype(int, copy=False)
df["dda_conversions"] = df["dda_conversions"].astype(float, copy=False)
return __order_columns(df)
def transform(df: pd.DataFrame, file_name) -> pd.DataFrame:
goal, date, *_ = file_name.split('_')
df = __common_transformation(df, date, goal)
# we only add goal in attribution df (google-analytics table).
return __transform_spend(df) if "spend" in file_name else __transform_attribution(df, goal)
def main(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
file = event
file_name = download_from_gcs(file)
df = pd.read_csv(f"/tmp/{file_name}")
transformed_df = transform(df, file_name)
with open(f"/tmp/bq_{file_name}", "w") as file_object:
file_object.write(transformed_df.to_csv(index=False))
with open(f"/tmp/bq_{file_name}", "rb") as file_object:
table = "google-analytics-spend" if "spend" in file_name else "google-analytics"
load_in_bigquery(file_object, dataset='attribution', table=table)
You might would prefer to check this thread:
BigQuery displaying wrong results - Duplicating data from Cloud Function?
Very shortly - the function is to be idempotent, and the state of the process (if the data/file was uploaded into BQ or not) should be kept outside of the cloud function. A text file (in some GCS bucket, not inside the cloud function memory, which can be erased as soon as the cloud function execution is finished) is an option, but GCS has plenty of drawbacks in this particular case. For example, a firestore - is much, much better choice.
You might consider the following algorithm -
When you cloud function starts, it should calculate some hash code based on input data - file/object metadata or file/object data or combination of both. That hash - should be unique for the given set of data.
Your cloud function connects to a predefined firestore collection (the project and the name can be provided in the environment variables) and checks if there a document/record with the given hash as an id - already exists or not.
If that hash already exists (the document exists) in the firestore collection - the cloud function finishes its execution and does not do anything else (can do logging, add some additional details into the firestore document if required, etc.). Thus simply finishes its execution.
If that hash is not found (the document does not exist) - the cloud function creates a new document with the given hash as an id. Some metadata details can be added into that document if needed.
Upon the document is created the cloud function continues the main 'workflow'.
A few things to bear in mind.
1/ IAM permissions - the service account under which the cloud function is running - should have relevant permissions on the firestore. Obviously the firestore API is to be enabled in the given project...
2/ What will happen if the cloud function creates a new firestore document, but then failed to load the data into BigQuery (for any reason). It may be that just a check on the firestore document existence is not enough. Thus, a proper 'state' is to be maintained in the firestore document. For example, when a new document is created (in the firestore), there should be a field __state and a value (for example) IN_PROGRESS is assigned to it. Then, when the data is loaded, the cloud function comes back to the firestore and updates that field with the value DONE (for example). But even that is not enough. As you have a load job - there may be cases, when the load is actually successful, but he cloud function failed (any reason including timeout). You might would like to think what to do in that case as well. In any case, having some 'state' monitoring in the firestore may help to understand/investigate the situation with the loading process. Automation of the monitoring might need developing a separate cloud function, but this is a separate story.
3/ As I mentioned in the thread I pointed above (see reasoning in that answer), loading data from inside the cloud function memory is a risky idea. I would suggest to think about that part of your algorithm again.
4/ It might be a good idea to move the loaded file/object from the "input" bucket to some "processed" (or "archive") bucket in case of success, and to move it into a "failure" bucket, in case the load failed. That is to be done in the cloud function code. Failure outcome can also be reflected in the firestore document (i.e. set the value of the __state field to FAILURE).

Groovy csv to string

I am using Dell Boomi to map data from one system to another. I can use groovy in the maps but have no experience with it. I tried to do this with the other Boomi tools, but have been told that I'll need to use groovy in a script. My inbound data is:
132265,Brown
132265,Gold
132265,Gray
132265,Green
I would like to output:
132265,"Brown,Gold,Gray,Green"
Hopefully this makes sense! Any ideas on the groovy code to make this work?
It can be elegantly solved with groupBy and the spread operator:
#Grapes(
#Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)
import org.apache.commons.csv.*
def csv = '''
132265,Brown
132265,Gold
132265,Gray
132265,Green
'''
def parsed = CSVParser.parse(csv, CSVFormat.DEFAULT.withHeader('code', 'color')
parsed.records.groupBy({ it.code }).each { k,v -> println "$k,\"${v*.color.join(',')}\"" }
The above prints:
132265,"Brown,Gold,Gray,Green"
Well, I don't know how are you getting your data, but here is a general way to achieve your goal. You can use a library, such as the one bellow to parse the csv.
https://github.com/xlson/groovycsv
The example for your data would be:
#Grab('com.xlson.groovycsv:groovycsv:1.1')
import static com.xlson.groovycsv.CsvParser.parseCsv
def csv = '''
132265,Brown
132265,Gold
132265,Gray
132265,Green
'''
def data = parseCsv(csv)
I believe you want to associate the number with various values of colors. So for each line you can create a map of the number and the colors associated with that number, splitting the line by ",":
map = [:]
for(line in data) {
number = line.split(',')[0]
colour = line.split(',')[1]
if(!map[number])
map[number] = []
map[number].add(colour)
}
println map
So map should contain:
[132265:["Brown","Gold","Gray","Green"]]
Well, if it is not what you want, you can extract the general idea.
Assuming your data is coming in as a comma separated string of data like this:
"132265,Brown 132265,Gold 132265,Gray 132265,Green 122222,Red 122222,White"
The following Groovy script code should do the trick.
def csvString = "132265,Brown 132265,Gold 132265,Gray 132265,Green 122222,Red 122222,White"
LinkedHashMap.metaClass.multiPut << { key, value ->
delegate[key] = delegate[key] ?: []; delegate[key] += value
}
def map = [:]
def csv = csvString.split().collect{ entry -> entry.split(",") }
csv.each{ entry -> map.multiPut(entry[0], entry[1]) }
def result = map.collect{ k, v -> k + ',"' + v.join(",") + '"'}.join("\n")
println result
Would print:
132265,"Brown,Gold,Gray,Green"
122222,"Red,White"
Do you HAVE to use scripting for some reason? This can be easily accomplished with out-of-the-box Boomi functionality.
Create a map function that prepends the ID field to a string of your choice (i.e. 222_concat_fields). Then use that value to set a dynamic process prop with that value.
The value of the process prop will contain the result of concatenating the name fields. Simply adding this function to your map should take care of it. Then use the final value to populate your result.
Well it depends upon the data how is it coming.
If the data which you have posted in the question is coming in a single document, then you can easily handle this in a map with groovy scripting.
If the data which you have posted in the question is coming into multiple documents i.e.
doc1: 132265,Brown
doc2: 132265,Gold
doc3: 132265,Gray
doc4: 132265,Green
In that case it cannot be handled into map. You will need to use Data Process Step with Custom Scripting.
For the code which you are asking to create in groovy depends upon the input profile in which you are getting the data. Please provide more information i.e. input profile, fields etc.

Using Python's csv.dictreader to search for specific key to then print its value

BACKGROUND:
I am having issues trying to search through some CSV files.
I've gone through the python documentation: http://docs.python.org/2/library/csv.html
about the csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) object of the csv module.
My understanding is that the csv.DictReader assumes the first line/row of the file are the fieldnames, however, my csv dictionary file simply starts with "key","value" and goes on for atleast 500,000 lines.
My program will ask the user for the title (thus the key) they are looking for, and present the value (which is the 2nd column) to the screen using the print function. My problem is how to use the csv.dictreader to search for a specific key, and print its value.
Sample Data:
Below is an example of the csv file and its contents...
"Mamer","285713:13"
"Champhol","461034:2"
"Station Palais","972811:0"
So if i want to find "Station Palais" (input), my output will be 972811:0. I am able to manipulate the string and create the overall program, I just need help with the csv.dictreader.I appreciate any assistance.
EDITED PART:
import csv
def main():
with open('anchor_summary2.csv', 'rb') as file_data:
list_of_stuff = []
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
print list_of_stuff
main()
The documentation you linked to provides half the answer:
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)
[...] maps the information read into a dict whose keys are given by the optional fieldnames parameter. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
It would seem that if the fieldnames parameter is passed, the given file will not have its first record interpreted as headers (the parameter will be used instead).
# file_data is the text of the file, not the filename
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
which will (apparently; I've been having trouble with it) produce the following data structure:
[{"title": "Mamer", "value": "285713:13"},
{"title": "Champhol", "value": "461034:2"},
{"title": "Station Palais", "value": "972811:0"}]
which may need to be further massaged into a title-to-value mapping by something like this:
data = {}
for i in list_of_stuff:
data[i["title"]] = i["value"]
Now just use the keys and values of data to complete your task.
And here it is as a dictionary comprehension:
data = {row["title"]: row["value"] for row in csv.DictReader(file_data, ("title", "value"))}
The currently accepted answer is fine, but there's a slightly more direct way of getting at the data. The dict() constructor in Python can take any iterable.
In addition, your code might have issues on Python 3, because Python 3's csv module expects the file to be opened in text mode, not binary mode. You can make your code compatible with 2 and 3 by using io.open instead of open.
import csv
import io
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
data = dict(csv.reader(f))
print(data['Champhol'])
As a warning, if your csv file has two rows with the same value in the first column, the later value will overwrite the earlier value. (This is also true of the other posted solution.)
If your program really is only supposed to print the result, there's really no reason to build a keyed dictionary.
import csv
import io
# Python 2/3 compat
try:
input = raw_input
except NameError:
pass
def main():
# Case-insensitive & leading/trailing whitespace insensitive
user_city = input('Enter a city: ').strip().lower()
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
for city, value in csv.reader(f):
if user_city == city.lower():
print(value)
break
else:
print("City not found.")
if __name __ == '__main__':
main()
The advantage of this technique is that the csv isn't loaded into memory and the data is only iterated over once. I also added a little code the calls lower on both the keys to make the match case-insensitive. Another advantage is if the city the user requests is near the top of the file, it returns almost immediately and stops looking through the file.
With all that said, if searching performance is your primary consideration, you should consider storing the data in a database.