Is there any method to import data from mysql to elasticsearch,batch by batch?If yes,how to do it??
Because the bulk import seems to be a problem.When i import 191000 items,only a few are being imported.
Related
Context:
I want to build an API using FastAPI and SQLModel
I need to use database reflection to create the SQLModel table models based on existing database
I couldnt find how to do database reflection directly in SQLModel
So I do database reflection in SQLAlchemy and now i want turn the SQLAlchemy table models into SQLModel ones to easily utilize them with FastAPI
Problem:
I cant figure out how to create an SQLModel table model based on the SQLAlchemy table model which i created with database reflection. as shown in the code below.
The docs of SQLModel suggest easy integration of SQLAlchemy so i figured it should be easy...
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker, scoped_session
from sqlalchemy.ext.declarative import declarative_base
from fastapi import FastAPI
from sqlmodel import SQLModel
app = FastAPI()
# do database reflection
engine = create_engine('sqlite:///database.db')
Base = declarative_base()
Base.metadata.reflect(engine)
print("alchemy meta tables:\n ", Base.metadata.tables)
# creates SQLAlchemy model based on database table schema of table 'hero'
class HeroDBReflection(Base):
__table__ = Base.metadata.tables['hero']
# TODO: How to create this Hero SQlModel data model based on the SQLAlchemy model 'HeroDBReflection'?
# class Hero(SQLModel, table=True):
# metadata = Base.metadata.tables['hero'] ?
# #app.post("/heroes/", response_model=Hero)
# def create_hero(hero: Hero):
# session.add(hero)
# session.commit
# session.refresh(hero)
# return hero
*This is my first Stackoverflow post, so I hope that my question is clear :)
Thanks a lot!
Since Modin does not support loading from multiple pyarrow files on s3, I am using pyarrow to load the data.
import s3fs
import modin.pandas as pd
from pyarrow import parquet
s3 = s3fs.S3FileSystem(
key=aws_key,
secret=aws_secret
)
table = parquet.ParquetDataset(
path_or_paths="s3://bucket/path",
filesystem=s3,
).read(
columns=["hotelId", "startDate", "endDate"]
)
# to get a pandas df the next step would be table.to_pandas()
If I know want to put the data in a Modin df for parallel computations without having to write to and read from a csv? Is there a way to construct the Modin df directly from a pyarrow.Table or at least from a pandas dataframe?
Mahesh's answer should work but I believe it would result in a full data copy (2X memory footprint by default: https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)
At the time of writing Modin does have a native arrow integration, so you can directly convert using
from modin.pandas.utils import from_arrow
mdf = from_arrow(pyarrow_table)
You can't construct the Modin dataframe directly out of a pyarrow.Table, because pandas doesn't support that, and Modin only supports a subset of the pandas API. However, the table has a method that converts it to a pandas dataframe, and you can construct the Modin dataframe out of that. Using table from your code:
import modin.pandas as pd
modin_dataframe = pd.Dataframe(table.to_pandas())
dataset = ds.dataset("abfs://test", format="parquet", partitioning="hive", filesystem=fs)
I can read datasets with the pyarrow dataset feature, but how can I write to a dataset with a different schema?
I seem to be able to import DirectoryPartitioning, for example, but I cannot figure out a way to save the data to create a schema like this:
from pyarrow.dataset import DirectoryPartitioning
partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())]))
print(partitioning.parse("/2009/11/3"))
Will we continue to use write_to_dataset to write Parquet files, or will there be a new method specific to the datasets class?
I'm trying to use the Aeson JSON library in haskell. Right now, i just need to use "decode" to read a JSON dump.
import Data.Aeson
import Data.ByteString as BS
import Control.Applicative
main :: IO ()
main = print $ decode <$> BS.readFile "json"
I got the following error when trying to compile/run it:
Couldn't match type 'ByteString'
with 'Data.ByteString.Lazy.Internal.ByteString'
NB: 'ByteString is defined in 'Data.ByteString.Internal'
'Data.ByteString.Lazy.Internal.ByteString'
is defined in 'Data.ByteString.Lazy.Internal.ByteString
This error doesn't make sense to me. I tried importing the files described by ghc, but the import either fails or doesn't solve the problem.
Thanks
There are two variants of ByteString: A strict (the default one), exported by Data.ByteString, and a lazy one, exported by Data.ByteString.Lazy.
Aeson works on top of lazy byte string, so you should change your second line to
import Data.ByteString.Lazy as BS
I'm trying to export a pandas dataframe to JSON with no luck. I've tried:
all_data.to_json("spdata.json") and all_data.to_json()
I get the same attribute error on both: 'DataFrame' object has no attribute 'to_json'. Just to make sure something isn't wrong with the DataFrame, i tested writing it to_csv and that worked.
Is there something i'm missing in my syntax or package i need to import? I am running Python version 2.7.5 which is part of an Enthought Canopy Express package. Imports at the beginning of my code are:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
from sys import argv
from datetime import datetime, timedelta
from dateutil.parser import parse
The to_json method was introduced to 0.12, so you'll need to upgrade your pandas to be able to use it.