I am trying to read as a Pandas DF an HTML table. However when using read_html function it finds only single row (column names, identifies all the columns correctly). When specifying header=0, there are subsequently no rows - the DF is empty. Why could it be so? I cannot disclose the dataset, however the table is quite complex and contains a lot of additional parameters, e.g.:
Instead of simple <td> I have <td bgcolor='#CCCCCC', align=center>
Could it be the source of the problem? Is there any viable solution? Thanks for help!
Related
I'd like to split a csv column containing a dictionary-like structure/values into component columns. For example input/output data, see this spreadsheet. Data will always come in that format ({"key":value,...}), with the number of key value pairs being arbitrary.
Not necessarily looking for a full solution here—more curious what the my options are for parsing data to create the output I want. Open to maybe using python to do some of this.
use in B3:
=INDEX(BYROW(A3:A5, LAMBDA(x, IFNA(HLOOKUP(B2:E2,
TRANSPOSE(SPLIT(TRIM(FLATTEN(SPLIT(REGEXREPLACE(x, "[\}\{"",]", ),
CHAR(10)))), ":")), 2, 0)))))
In your case, you can use formula
=SPLIT(REGEXREPLACE(A1,""".*"":|\{|\}|\n|\r",""),",")
And here the result
I have come across one thing when we consider CSV as input to crawler
crawler doesn't identify the columns header when all the data is in string format in CSV.
#P1 Headers are displayed as col0,col1...colN.
#P2 And actual column names are considered as data.
#P3 Metadata (i.e. column datatype is shown as string even the CSV dataset consists of date/timestamp value)
If we are going to consider custom (CSV) classifier then we are manually mentioning the column header.
#P2 will get covered i.e. column names will be removed however
#P1 still remain same. column header will be displayed as col0,col1...colN.
There are 3 things I want to avoid and achieve expected result.
CSV with strings only should show actual column names instead of col0,col1...colN.
Metadata of generated table should show correctly (i.e. date/timestamp, string) once it is crawled by crawler.
If Custom classifier is used, we need to mention column header names manually in classifier, yet result is not satisfactory.
Need generic solution instead of manual interventions.
Have gone through this document: here
If anyone has already implemented the solution, Please help.
I got solution to one of the above points. Headers i.e. first line of CSV is displayed by using 'Has heading' in CSV classifier.
However, Solution for following is yet to figure out.
Metadata of CSV file is shown as string even if column contains timestamp/date value. Crawler is reading these datatypes as string.
Custom classifier needs manual interventions. I have mentioned all column names in classifier. Is there generic solution?
If we are using pd.to_csv to write the dataframe, then to avoid getting column names as col1, col2 and so on, add the parameter
index_label='index' such as:
pd.to_csv(df,index_label='index')
I have pulled in data from a number of csv files, as well as a database. I wish to use a merge function to make a dataframe isolating the phone numbers that are contained in both dataframes(one originating from csv, the other originating from the database). However, the dataframe from the database displays as type 'nonetype.' This disallows any operation such as merge. How can i change this to allow the operation?
The data comes in from the database as a list of tuples. I then convert this to a dataframe. However, as stated above, it displays as 'nonetype.' I'm assuming at the moment I am confused about about how dataframes handle data types.
#Grab Data
mycursor = mydb.cursor()
mycursor.execute("SELECT DISTINCT(Cell) FROM crm_data.ap_clients Order By Cell asc;")
apclients = mycursor.fetchall()
#Clean Phone Number Data
for index, row in data.iterrows():
data['phone_number'][index] = data['phone_number'][index][-10:]
for index, row in data2.iterrows():
data2['phone_number'][index] = data2['phone_number'][index][-10:]
for index, row in data3.iterrows():
data3['phone_number'][index] = data3['phone_number'][index][-10:]
#make data frame from csv files
fbl = pd.concat([data,data2,data3], axis=0, sort=False)
#make data frame from apclients(database extraction)
apc = pd.DataFrame(apclients)
#perfrom merge finding all records in both frames
successfulleads= pd.merge(fbl, apc, left_on ='phone_number', right_on='0')
#type(apc) returns NoneType
The expected results are to find all records in both dataframes, along with a count so that I may compare the two sets. Any help is greatly appreciated from this great community :)
So it looks like I had a function to rename the column of the dataframe as shown below:
apc = apc.rename(columns={'0': 'phone_number'}, inplace=True)
for col in apc.columns:
print(col)
the code snippet out of the above responsible:
inplace=True
This snippet dictates whether or not the object is modified in the dataframe, or whether a copy is made. The return type on said object is of nonetype.
Hope this helps whoever ends up in my position. A great thanks again to the community. :)
I am looking for a way to convert a rather big (3GB) json file to csv. I tried using R and this is the code that I used.
library("rjson")
data <- fromJSON(file="C:/POI data 30 Rows.json")
json_data <- as.data.frame(data)
write.csv(json_data, file='C:/POI data 30 Rows Exported.csv')
The example I am using is only a subset of the total data of about 30 rows. which I extracted using EMeditor and copied and pasted into a text file. The problem is however it only converts the first row of the data.
I am not an experienced programmer and have tried everything on youtube tutorials from php to excel and nothing seems to work. The problem is I have no Idea what the structure of the data is so I can not create a predetermined data frame and there is a number of missing values within the data.
Any advice would be greatly appreciated.
I want to do initial sorting on first column in a (using Thymeleaf) where i lookup the value in a message file. That means that the sort order can differ per country, which is what I want. Can I achieve that in html and Thymeleaf or do I have to look up the translation first before letting Thymeleaf iterate my data to create the table?
<table>
<tr th:each="object : ${objects}">
<td th:text="#{${#strings.concat('messagekeyprefix.', object.name)}}"></td>
<td th:text="{object.value}"></td>
<tr>
</table>
And in the different message.properties files I have the translations
messagekeyprefix.name1 = Xyz
messagekeyprefix.name2 = Def
messagekeyprefix.name3 = Abc
Using the above code will present the rows in the order of the "objects". But I would like to, maybe with dialect or something, do initial sorting on first column and it should be sorted on the translated names (so order can different between countries).
As per conversation and information provided, no Thymeleaf doesn't have sorting capability at this stage, and I doubt it will in near future.
Your best option is to have POJO read your file, and based on specifications in file actually sort either at database query level (best) or at object level (good).