trustgraph/trustgraph-core/scripts/concat-parquet
cybermaggedon f081933217
Feature/subpackages (#80)
* Renaming what will become the core package

* Tweaking to get  package build working

* Fix metering merge

* Rename to core directory

* Bump version.  Use namespace searching for packaging trustgraph-core

* Change references to trustgraph-core

* Forming embeddings-hf package

* Reference modules in core package.

* Build both packages to one container, bump version

* Update YAMLs
2024-09-30 14:00:29 +01:00

45 lines
746 B
Python
Executable file

#!/usr/bin/env python3
"""
Concatenates multiple parquet files into a single parquet output
"""
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import sys
import argparse
parser = argparse.ArgumentParser(
prog="combine-parquet",
description=__doc__
)
parser.add_argument(
'-i', '--input',
nargs='*',
help=f'Input files'
)
parser.add_argument(
'-o', '--output',
help=f'Output files'
)
args = parser.parse_args()
df = None
for file in args.input:
part = pq.read_table(file).to_pandas()
if df is None:
df = part
else:
df = pd.concat([df, part], ignore_index=True)
if df is not None:
table = pa.Table.from_pandas(df)
pq.write_table(table, args.output)