Skip to content

Conversation

ZTE-EBASE
Copy link

@ZTE-EBASE ZTE-EBASE commented Jul 12, 2025

… Data Ingestion

gpfdist is a file distribution program in Cloudberry that can parallel load external data into the database. However, it has the drawback that data files must reside on the same machine as the tool. Therefore,extending it to support the SFTP protocol can address the above drawback and enable loading files from a remote server.

Fixes #ISSUE_Number

What does this PR do?

By extending the gpfdist tool to support the SFTP protocol, remote data loading has been achieved, overcoming the challenge of having the tool and data files on the same machine.

Type of Change

New feature (non-breaking change)

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:
The ssh2 library needs to be introduced during compilation and placed under /usr/local.

Checklist

Additional Context

Under this approach, the location template for the external table is:

CREATE EXTERNAL TABLE ext1 (d varchar(20)) location ('gpfdist://ip:port/<sftp://sftp-user:passwd@sftp-hostip:sftp-port/file.csv>') format 'csv' (DELIMITER '|');

Related Test Case:
1 Start gpfdist

[cdbberry@node196 ~]$ gpfdist -d /home/cdbberry/ -p 9876 -l gpfdist.log &
[1] 83161
[cdbberry@node196 ~]$ 2025-07-12 14:49:21 83161 INFO Before opening listening sockets - following listening sockets are available:
2025-07-12 14:49:21 83161 INFO IPV6 socket: [::]:9876
2025-07-12 14:49:21 83161 INFO IPV4 socket: 0.0.0.0:9876
2025-07-12 14:49:21 83161 INFO Trying to open listening socket:
2025-07-12 14:49:21 83161 INFO IPV6 socket: [::]:9876
2025-07-12 14:49:21 83161 INFO Opening listening socket succeeded
2025-07-12 14:49:21 83161 INFO Trying to open listening socket:
2025-07-12 14:49:21 83161 INFO IPV4 socket: 0.0.0.0:9876
2025-07-12 14:49:21 83161 INFO Opening listening socket succeeded
Serving HTTP on port 9876, directory /home/cdbberry

2 create table (external)

CREATE table test(
id int,
name varchar(20)
);

CREATE external table testww(
id int,
name varchar(20)
)
location 
('gpfdist://10.229.89.196:9876/<sftp://xxx:xxxx@xxx:22/xx.csv>')
format 'csv' (delimiter as '|' NULL as '' FILL MISSING FIELDS) SEGMENT REJECT LIMIT 2 ROWS;

3 data load

 insert into test select * from test_ext;

4 result

postgres=# insert into test select * from test_ext;
INSERT 0 10
postgres=# select * from test;
 id |   name    
----+-----------
  2 | ZTE-EBASE
  3 | ZTE-EBASE
  4 | ZTE-EBASE
  6 | ZTE-EBASE
  7 | ZTE-EBASE
  8 | ZTE-EBASE
  9 | ZTE-EBASE
 10 | ZTE-EBASE
  1 | ZTE-EBASE
  5 | ZTE-EBASE
(10 rows)

cat test.csv
1|ZTE-EBASE
2|ZTE-EBASE
3|ZTE-EBASE
4|ZTE-EBASE
5|ZTE-EBASE
6|ZTE-EBASE
7|ZTE-EBASE
8|ZTE-EBASE
9|ZTE-EBASE
10|ZTE-EBASE

The amount and content of the table data are consistent with the file.

CI Skip Instructions


Copy link
Member

@tuhaihe tuhaihe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ZTE-EBASE thanks for your contribution. Some errors were returned, and the build process needs to be fixed.

@ZTE-EBASE
Copy link
Author

In the Dependencies section of the PR, I have attached the dependency information. Please introduce the dependencies to build a new CI environment.

The ssh2 library needs to be introduced during compilation and placed under /usr/local.

In file included from url_file.c:21:
../../../../src/include/fstream/gfile.h:29:10: fatal error: libssh2.h: No such file or directory
   29 | #include <libssh2.h>
      |          ^~~~~~~~~~~
compilation terminated.
make[4]: *** [<builtin>: url_file.o] Error 1

@tuhaihe
Copy link
Member

tuhaihe commented Jul 14, 2025

In the Dependencies section of the PR, I have attached the dependency information. Please introduce the dependencies to build a new CI environment.

The ssh2 library needs to be introduced during compilation and placed under /usr/local.

In file included from url_file.c:21:
../../../../src/include/fstream/gfile.h:29:10: fatal error: libssh2.h: No such file or directory
   29 | #include <libssh2.h>
      |          ^~~~~~~~~~~
compilation terminated.
make[4]: *** [<builtin>: url_file.o] Error 1

Maybe we update the dependency in the repo https://github.com/apache/cloudberry-devops-release

@ZTE-EBASE
Copy link
Author

ZTE-EBASE commented Jul 14, 2025

Thank you, I have updated. Please proceed with the compilation in this repository.

@ZTE-EBASE
Copy link
Author

How can we install and introduce the dependencies of arrow_dataset, arrow, and parquet in a CloudBerry image? We are compiling from source and copying them into the image. We are not sure how the CloudBerry community achieves this, as we need these dependencies for a future feature. @tuhaihe @yjhjstz

@leborchuk
Copy link
Contributor

How can we install and introduce the dependencies of arrow_dataset, arrow, and parquet in a CloudBerry image? We are compiling from source and copying them into the image. We are not sure how the CloudBerry community achieves this, as we need these dependencies for a future feature. @tuhaihe @yjhjstz

It could be done in https://github.com/apache/cloudberry-devops-release repository. See how we get and compile
xerces https://github.com/apache/cloudberry-devops-release/blob/main/images/docker/cbdb/build/rocky9/Dockerfile#L142

Note that work has to be done for all supported OSes (Rocky 8 & 9 currently).

@ZTE-EBASE ZTE-EBASE requested review from yjhjstz and tuhaihe July 15, 2025 08:11
@tuhaihe
Copy link
Member

tuhaihe commented Jul 15, 2025

Hey @ZTE-EBASE would you like to squash your commits into one? Then I can guide you on this.

… Data Ingestion

gpfdist is a file distribution program in Cloudberry that can parallel
load external data into the database. However, it has the drawback
that data files must reside on the same machine as the tool.
Therefore,extending it to support the SFTP protocol can address the
above drawback and enable loading files from a remote server.

Add the libssh2 library and specify the link.

Implement remote data file reading using the libssh2 library with
gpfdist.

Extending gpfdist in Cloudberry Database to Support
SFTP Protocol for Data Ingestion -- add LIBSSH2 macro

ADD LIBSSH2 macro Label the SFTP-related code to indicate its
characteristics.
@@ -103,4 +127,11 @@ void gfile_printf_then_putc_newline(const char*format,...) pg_attribute_printf(1
void*gfile_malloc(size_t size);
void gfile_free(void*a);

#ifdef LIBSSH2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where defined LIBSSH2 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enabled by default or through a parameter option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when configure --enable-gpfdist, we can dynamic check libssh2, when exist then add LIBSSH2 macro.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit complicated, as I haven't really worked with this before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to #1151

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In file included from url_file.c:21:
../../../../src/include/fstream/gfile.h:75:9: error: unknown type name ‘LIBSSH2_SESSION’
75 | LIBSSH2_SESSION *session;
| ^~~~~~~~~~~~~~~
../../../../src/include/fstream/gfile.h:76:9: error: unknown type name ‘LIBSSH2_SFTP’
76 | LIBSSH2_SFTP *sftp_session;
| ^~~~~~~~~~~~
../../../../src/include/fstream/gfile.h:77:9: error: unknown type name ‘LIBSSH2_SFTP_HANDLE’
77 | LIBSSH2_SFTP_HANDLE *sftp_handle;
| ^~~~~~~~~~~~~~~~~~~

#ifdef LIBSSH2
#include <libssh2.h>
#include <libssh2_sftp.h>
#endif

It's clear that there is no dependency on the libssh2 library. There are header files, but that's it. As mentioned earlier, you said that you would check for the existence of LIBSSH2 through --enable-gpfdist. Does that mean it's up to me to control it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, It's your responsibility.

@tuhaihe
Copy link
Member

tuhaihe commented Jul 16, 2025

Hi @ZTE-EBASE, here is some feedback on your commit message:

To be updated:

  • Title: We need to use Apache Cloudberry or Cloudberry instead of Cloudberry Database
  • Commit body, we should refactor it based on the original commit messages, not just combine them simply. If there is a need, we can list the key changes one by one to have a better visual.
  • Commit trailer: I believe you commit on behalf of your organization, so you can list your organization's public contact information (Org name + Org public contact point); also, we can add the related mailing thread or GitHub Discussion URL for a better context.

Here is one sample based one your commit message for your reference:

Feature: Add SFTP support to gpfdist for data ingestion

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify
SFTP locations.

Key changes include:
- Integrating the libssh2 library to handle SFTP communication.
- Implementing remote file reading capabilities within gpfdist.
- Adding the `LIBSSH2` preprocessor macro to conditionally compile the
  new SFTP-related code.

Authored-by: Your NAME <NAME@EXAMPLE.COM>
Co-authored-by: More NAME <NAME@EXAMPLE.COM>
on-behalf-of: @ZTE-EBASE <xxx@zte-ebase.com>

See: https://github.com/apache/cloudberry/discussions/1205

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify
SFTP locations.

Key change include:
when configure --enable-gpfdist, can dynamic check libssh2,
if exist then add LIBSSH2 macro.
@ZTE-EBASE ZTE-EBASE requested a review from yjhjstz July 19, 2025 03:16
configure.ac Outdated
# Check libssh2 >= 1.0.0
PKG_CHECK_MODULES([LIBSSH2], [libssh2 >= 1.0.0],
[AC_DEFINE([LIBSSH2], [1], [Define if libssh2 is available])],
[AC_MSG_ERROR([libssh2 >= 1.0.0 is required for gpfdist support])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PKG_CHECK_MODULES([LIBSSH2], [libssh2 >= 1.0.0],
  [AC_DEFINE([LIBSSH2], [1], [Define if libssh2 is available])],
  [AC_MSG_WARN([libssh2 >= 1.0.0 not found, gpfdist will build without libssh2 support])]
)

If libssh2 is not found, only a warning is issued (AC_MSG_WARN), the macro is not defined, and the configure process does not fail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll make the changes. Thank you!

ingestion

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE`
to specify SFTP locations.

Key change include:
Adding the `LIBSSH2` preprocessor macro to conditionally compile the
new SFTP-related code.when configure --enable-gpfdist, we can dynamic
check libssh2, if exist then add LIBSSH2 macro. else will print
warning information.
@ZTE-EBASE ZTE-EBASE requested a review from yjhjstz July 19, 2025 08:46
gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE`
to specify SFTP locations.

Key change include:
Fix the compilation errors related to the libssh2 library option
in the configure file.
tuhaihe added a commit to tuhaihe/cloudberry-devops-release that referenced this pull request Jul 22, 2025
libssh2-devel is introduced as a dependency package for the new feature
in the PR apache/cloudberry#1226.
@tuhaihe
Copy link
Member

tuhaihe commented Jul 22, 2025

Hey, when test this PR, some errors returned:

Build Env

docker run --name cbdb-dev -it --rm -h cdw --shm-size=2gb apache/incubator-cloudberry:cbdb-build-rocky9-latest

Then

git clone --branch temp_cloudberry https://github.com/ZTE-EBASE/cloudberry.git
cd cloudberry/
sudo dnf install --enablerepo=epel libssh2-devel

Then run (following this guide):

sudo rm -rf /usr/local/cloudberry-db
sudo chmod a+w /usr/local
mkdir -p /usr/local/cloudberry-db/lib
sudo cp -v /usr/local/xerces-c/lib/libxerces-c.so \
           /usr/local/xerces-c/lib/libxerces-c-3.*.so \
           /usr/local/cloudberry-db/lib
sudo chown -R gpadmin:gpadmin /usr/local/cloudberry-db
export LD_LIBRARY_PATH=/usr/local/cloudberry-db/lib:LD_LIBRARY_PATH
./configure --prefix=/usr/local/cloudberry-db \
            --disable-external-fts \
            --enable-debug \
            --enable-cassert \
            --enable-debug-extensions \
            --enable-gpcloud \
            --enable-ic-proxy \
            --enable-mapreduce \
            --enable-orafce \
            --enable-orca \
            --enable-pxf \
            --enable-tap-tests \
            --with-gssapi \
            --with-ldap \
            --with-libxml \
            --with-lz4 \
            --with-pam \
            --with-perl \
            --with-pgport=5432 \
            --with-python \
            --with-pythonsrc-ext \
            --with-ssl=openssl \
            --with-uuid=e2fs \
            --with-includes=/usr/local/xerces-c/include \
            --with-libraries=/usr/local/cloudberry-db/lib

make -j$(nproc) -C ~/cloudberry

Then, errors were returned:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-but-set-variable -Werror=implicit-fallthrough=3 -Wno-format-truncation -Wno-stringop-truncation -g -O3 -fPIC  -DUSE_INTERNAL_FTS=1  -Werror=uninitialized -Werror=implicit-function-declaration -Werror -I../../../src/interfaces/libpq -I../../../src/include   -D_GNU_SOURCE -I/usr/include/libxml2  -I/usr/local/xerces-c/include  -c -o nodeDynamicForeignscan.o nodeDynamicForeignscan.c
( echo src/backend/executor/execAmi.o src/backend/executor/execAsync.o src/backend/executor/execCurrent.o src/backend/executor/execExpr.o src/backend/executor/execExprInterp.o src/backend/executor/execGrouping.o src/backend/executor/execIndexing.o src/backend/executor/execJunk.o src/backend/executor/execMain.o src/backend/executor/execParallel.o src/backend/executor/execPartition.o src/backend/executor/execProcnode.o src/backend/executor/execReplication.o src/backend/executor/execSRF.o src/backend/executor/execScan.o src/backend/executor/execTuples.o src/backend/executor/execUtils.o src/backend/executor/functions.o src/backend/executor/instrument.o src/backend/executor/nodeAgg.o src/backend/executor/nodeAppend.o src/backend/executor/nodeBitmapAnd.o src/backend/executor/nodeBitmapHeapscan.o src/backend/executor/nodeBitmapIndexscan.o src/backend/executor/nodeBitmapOr.o src/backend/executor/nodeCtescan.o src/backend/executor/nodeCustom.o src/backend/executor/nodeForeignscan.o src/backend/executor/nodeFunctionscan.o src/backend/executor/nodeGather.o src/backend/executor/nodeGatherMerge.o src/backend/executor/nodeGroup.o src/backend/executor/nodeHash.o src/backend/executor/nodeHashjoin.o src/backend/executor/nodeIncrementalSort.o src/backend/executor/nodeIndexonlyscan.o src/backend/executor/nodeIndexscan.o src/backend/executor/nodeLimit.o src/backend/executor/nodeLockRows.o src/backend/executor/nodeMaterial.o src/backend/executor/nodeMemoize.o src/backend/executor/nodeMergeAppend.o src/backend/executor/nodeMergejoin.o src/backend/executor/nodeModifyTable.o src/backend/executor/nodeNamedtuplestorescan.o src/backend/executor/nodeNestloop.o src/backend/executor/nodeProjectSet.o src/backend/executor/nodeRecursiveunion.o src/backend/executor/nodeResult.o src/backend/executor/nodeRuntimeFilter.o src/backend/executor/nodeSamplescan.o src/backend/executor/nodeSeqscan.o src/backend/executor/nodeSetOp.o src/backend/executor/nodeSort.o src/backend/executor/nodeSubplan.o src/backend/executor/nodeSubqueryscan.o src/backend/executor/nodeTableFuncscan.o src/backend/executor/nodeTidrangescan.o src/backend/executor/nodeTidscan.o src/backend/executor/nodeUnique.o src/backend/executor/nodeValuesscan.o src/backend/executor/nodeWindowAgg.o src/backend/executor/nodeWorktablescan.o src/backend/executor/spi.o src/backend/executor/tqueue.o src/backend/executor/tstoreReceiver.o src/backend/executor/nodeMotion.o src/backend/executor/nodeShareInputScan.o src/backend/executor/nodeTableFunction.o src/backend/executor/nodeSequence.o src/backend/executor/nodeAssertOp.o src/backend/executor/nodeSplitUpdate.o src/backend/executor/nodeTupleSplit.o src/backend/executor/nodePartitionSelector.o src/backend/executor/execDynamicIndexes.o src/backend/executor/nodeDynamicSeqscan.o src/backend/executor/nodeDynamicIndexscan.o src/backend/executor/nodeDynamicIndexOnlyscan.o src/backend/executor/nodeDynamicBitmapHeapscan.o src/backend/executor/nodeDynamicBitmapIndexscan.o src/backend/executor/nodeDynamicForeignscan.o ) >objfiles.txt
make[3]: Leaving directory '/home/gpadmin/cloudberry/src/backend/executor'
make[2]: Leaving directory '/home/gpadmin/cloudberry/src/backend'
make[1]: *** [Makefile:45: all-backend-recurse] Error 2
make[1]: Leaving directory '/home/gpadmin/cloudberry/src'
make: *** [GNUmakefile:11: all-src-recurse] Error 2
make: Leaving directory '/home/gpadmin/cloudberry'

When running

make install -C ~/cloudberry
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-but-set-variable -Werror=implicit-fallthrough=3 -Wno-format-truncation -Wno-stringop-truncation -g -O3 -fPIC  -DUSE_INTERNAL_FTS=1  -Werror=uninitialized -Werror=implicit-function-declaration -Werror -I../../../../src/include   -D_GNU_SOURCE -I/usr/include/libxml2  -I/usr/local/xerces-c/include  -c -o url_file.o url_file.c
In file included from url_file.c:21:
../../../../src/include/fstream/gfile.h:75:9: error: unknown type name ‘LIBSSH2_SESSION’
   75 |         LIBSSH2_SESSION *session;
      |         ^~~~~~~~~~~~~~~
../../../../src/include/fstream/gfile.h:76:9: error: unknown type name ‘LIBSSH2_SFTP’
   76 |         LIBSSH2_SFTP *sftp_session;
      |         ^~~~~~~~~~~~
../../../../src/include/fstream/gfile.h:77:9: error: unknown type name ‘LIBSSH2_SFTP_HANDLE’
   77 |         LIBSSH2_SFTP_HANDLE *sftp_handle;
      |         ^~~~~~~~~~~~~~~~~~~
make[4]: *** [<builtin>: url_file.o] Error 1
make[4]: Leaving directory '/home/gpadmin/cloudberry/src/backend/access/external'
make[3]: *** [../../../src/backend/common.mk:39: external-recursive] Error 2
make[3]: Leaving directory '/home/gpadmin/cloudberry/src/backend/access'
make[2]: *** [common.mk:39: access-recursive] Error 2
make[2]: Leaving directory '/home/gpadmin/cloudberry/src/backend'
make[1]: *** [Makefile:45: install-backend-recurse] Error 2
make[1]: Leaving directory '/home/gpadmin/cloudberry/src'
make: *** [GNUmakefile:11: install-src-recurse] Error 2
make: Leaving directory '/home/gpadmin/cloudberry'

@tuhaihe
Copy link
Member

tuhaihe commented Jul 22, 2025

The dependency libssh2-devel will be installed in the dev image via this PR:apache/cloudberry-devops-release#27 once this PR is approved.

@ZTE-EBASE
Copy link
Author

Hey, when test this PR, some errors returned:

Build Env

docker run --name cbdb-dev -it --rm -h cdw --shm-size=2gb apache/incubator-cloudberry:cbdb-build-rocky9-latest

Then

git clone --branch temp_cloudberry https://github.com/ZTE-EBASE/cloudberry.git
cd cloudberry/
sudo dnf install --enablerepo=epel libssh2-devel

Then run (following this guide):

sudo rm -rf /usr/local/cloudberry-db
sudo chmod a+w /usr/local
mkdir -p /usr/local/cloudberry-db/lib
sudo cp -v /usr/local/xerces-c/lib/libxerces-c.so \
           /usr/local/xerces-c/lib/libxerces-c-3.*.so \
           /usr/local/cloudberry-db/lib
sudo chown -R gpadmin:gpadmin /usr/local/cloudberry-db
export LD_LIBRARY_PATH=/usr/local/cloudberry-db/lib:LD_LIBRARY_PATH
./configure --prefix=/usr/local/cloudberry-db \
            --disable-external-fts \
            --enable-debug \
            --enable-cassert \
            --enable-debug-extensions \
            --enable-gpcloud \
            --enable-ic-proxy \
            --enable-mapreduce \
            --enable-orafce \
            --enable-orca \
            --enable-pxf \
            --enable-tap-tests \
            --with-gssapi \
            --with-ldap \
            --with-libxml \
            --with-lz4 \
            --with-pam \
            --with-perl \
            --with-pgport=5432 \
            --with-python \
            --with-pythonsrc-ext \
            --with-ssl=openssl \
            --with-uuid=e2fs \
            --with-includes=/usr/local/xerces-c/include \
            --with-libraries=/usr/local/cloudberry-db/lib

make -j$(nproc) -C ~/cloudberry

Then, errors were returned:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-but-set-variable -Werror=implicit-fallthrough=3 -Wno-format-truncation -Wno-stringop-truncation -g -O3 -fPIC  -DUSE_INTERNAL_FTS=1  -Werror=uninitialized -Werror=implicit-function-declaration -Werror -I../../../src/interfaces/libpq -I../../../src/include   -D_GNU_SOURCE -I/usr/include/libxml2  -I/usr/local/xerces-c/include  -c -o nodeDynamicForeignscan.o nodeDynamicForeignscan.c
( echo src/backend/executor/execAmi.o src/backend/executor/execAsync.o src/backend/executor/execCurrent.o src/backend/executor/execExpr.o src/backend/executor/execExprInterp.o src/backend/executor/execGrouping.o src/backend/executor/execIndexing.o src/backend/executor/execJunk.o src/backend/executor/execMain.o src/backend/executor/execParallel.o src/backend/executor/execPartition.o src/backend/executor/execProcnode.o src/backend/executor/execReplication.o src/backend/executor/execSRF.o src/backend/executor/execScan.o src/backend/executor/execTuples.o src/backend/executor/execUtils.o src/backend/executor/functions.o src/backend/executor/instrument.o src/backend/executor/nodeAgg.o src/backend/executor/nodeAppend.o src/backend/executor/nodeBitmapAnd.o src/backend/executor/nodeBitmapHeapscan.o src/backend/executor/nodeBitmapIndexscan.o src/backend/executor/nodeBitmapOr.o src/backend/executor/nodeCtescan.o src/backend/executor/nodeCustom.o src/backend/executor/nodeForeignscan.o src/backend/executor/nodeFunctionscan.o src/backend/executor/nodeGather.o src/backend/executor/nodeGatherMerge.o src/backend/executor/nodeGroup.o src/backend/executor/nodeHash.o src/backend/executor/nodeHashjoin.o src/backend/executor/nodeIncrementalSort.o src/backend/executor/nodeIndexonlyscan.o src/backend/executor/nodeIndexscan.o src/backend/executor/nodeLimit.o src/backend/executor/nodeLockRows.o src/backend/executor/nodeMaterial.o src/backend/executor/nodeMemoize.o src/backend/executor/nodeMergeAppend.o src/backend/executor/nodeMergejoin.o src/backend/executor/nodeModifyTable.o src/backend/executor/nodeNamedtuplestorescan.o src/backend/executor/nodeNestloop.o src/backend/executor/nodeProjectSet.o src/backend/executor/nodeRecursiveunion.o src/backend/executor/nodeResult.o src/backend/executor/nodeRuntimeFilter.o src/backend/executor/nodeSamplescan.o src/backend/executor/nodeSeqscan.o src/backend/executor/nodeSetOp.o src/backend/executor/nodeSort.o src/backend/executor/nodeSubplan.o src/backend/executor/nodeSubqueryscan.o src/backend/executor/nodeTableFuncscan.o src/backend/executor/nodeTidrangescan.o src/backend/executor/nodeTidscan.o src/backend/executor/nodeUnique.o src/backend/executor/nodeValuesscan.o src/backend/executor/nodeWindowAgg.o src/backend/executor/nodeWorktablescan.o src/backend/executor/spi.o src/backend/executor/tqueue.o src/backend/executor/tstoreReceiver.o src/backend/executor/nodeMotion.o src/backend/executor/nodeShareInputScan.o src/backend/executor/nodeTableFunction.o src/backend/executor/nodeSequence.o src/backend/executor/nodeAssertOp.o src/backend/executor/nodeSplitUpdate.o src/backend/executor/nodeTupleSplit.o src/backend/executor/nodePartitionSelector.o src/backend/executor/execDynamicIndexes.o src/backend/executor/nodeDynamicSeqscan.o src/backend/executor/nodeDynamicIndexscan.o src/backend/executor/nodeDynamicIndexOnlyscan.o src/backend/executor/nodeDynamicBitmapHeapscan.o src/backend/executor/nodeDynamicBitmapIndexscan.o src/backend/executor/nodeDynamicForeignscan.o ) >objfiles.txt
make[3]: Leaving directory '/home/gpadmin/cloudberry/src/backend/executor'
make[2]: Leaving directory '/home/gpadmin/cloudberry/src/backend'
make[1]: *** [Makefile:45: all-backend-recurse] Error 2
make[1]: Leaving directory '/home/gpadmin/cloudberry/src'
make: *** [GNUmakefile:11: all-src-recurse] Error 2
make: Leaving directory '/home/gpadmin/cloudberry'

When running

make install -C ~/cloudberry
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-but-set-variable -Werror=implicit-fallthrough=3 -Wno-format-truncation -Wno-stringop-truncation -g -O3 -fPIC  -DUSE_INTERNAL_FTS=1  -Werror=uninitialized -Werror=implicit-function-declaration -Werror -I../../../../src/include   -D_GNU_SOURCE -I/usr/include/libxml2  -I/usr/local/xerces-c/include  -c -o url_file.o url_file.c
In file included from url_file.c:21:
../../../../src/include/fstream/gfile.h:75:9: error: unknown type name ‘LIBSSH2_SESSION’
   75 |         LIBSSH2_SESSION *session;
      |         ^~~~~~~~~~~~~~~
../../../../src/include/fstream/gfile.h:76:9: error: unknown type name ‘LIBSSH2_SFTP’
   76 |         LIBSSH2_SFTP *sftp_session;
      |         ^~~~~~~~~~~~
../../../../src/include/fstream/gfile.h:77:9: error: unknown type name ‘LIBSSH2_SFTP_HANDLE’
   77 |         LIBSSH2_SFTP_HANDLE *sftp_handle;
      |         ^~~~~~~~~~~~~~~~~~~
make[4]: *** [<builtin>: url_file.o] Error 1
make[4]: Leaving directory '/home/gpadmin/cloudberry/src/backend/access/external'
make[3]: *** [../../../src/backend/common.mk:39: external-recursive] Error 2
make[3]: Leaving directory '/home/gpadmin/cloudberry/src/backend/access'
make[2]: *** [common.mk:39: access-recursive] Error 2
make[2]: Leaving directory '/home/gpadmin/cloudberry/src/backend'
make[1]: *** [Makefile:45: install-backend-recurse] Error 2
make[1]: Leaving directory '/home/gpadmin/cloudberry/src'
make: *** [GNUmakefile:11: install-src-recurse] Error 2
make: Leaving directory '/home/gpadmin/cloudberry'

The issue has been identified and resolved, but please review it.

@tuhaihe
Copy link
Member

tuhaihe commented Jul 23, 2025

The issue has been identified and resolved, but please review it.

Thanks for your fix. Now, it has been tested and can be built successfully!

王平10304955 added 2 commits July 26, 2025 11:07
gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE`
to specify SFTP locations.

Key change include:
Fix the compilation errors related to the libssh2 library option
in the configure file.
Feature:Supporting the Loading of bz Format Files

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify
SFTP locations.

Key change information:
Implement the loading of .bz2 files by utilizing the read functions
provided by the libssh2 library.

Add SFTP support to gpfdist for data ingestion
Feature:Supporting the Loading of bz(bz2) Format Files

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify
SFTP locations.

Key change information:
Implement the loading of .bz2 files by utilizing the read functions
provided by the libssh2 library.
王平10304955 added 5 commits August 18, 2025 10:38
Feature:Supporting the Loading of gz Format Files

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify
SFTP locations.

Key change information:
Implement the loading of gz files by utilizing the read functions
provided by the libssh2 library.
Feature:Support for SFTP server data access
with IPv6 addresses.

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify
SFTP locations.
Key change information:
Support for SFTP server data access based on address type,
including both IPv4 and IPv6 addresses.
Feature:Support for writing CloudBerry table data to a
remote SFTP server to achieve backup functionality.

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.

This commit extends gpfdist to support the SFTP protocol, enabling
users to ingest data directly from remote servers. This enhancement
streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify
SFTP locations.

Key change information:
Implement the `sftp_write` function to write CloudBerry table data
to a remote SFTP server, thereby achieving backup functionality.
Implement the log rotation feature for gpfdist.

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.
This commit addresses the issue of gpfdist logs continuously growing
and occupying a large amount of disk space in a persistent working
scenario. To avoid uncontrolled growth, the log rotation feature is
implemented. The characteristics are as follows:
1) Logs will be rotated when their size exceeds 512MB;
2) Only two logs are kept in the log set, one is the latest
current log, and the other is the previous rotated log.
Implement the log rotation feature for gpfdist.

gpfdist, Cloudberry's parallel file distribution program,
traditionally required data files to be co-located with the gpfdist
process. This limitation made it cumbersome to load data from remote
servers, often requiring an extra data transfer step.
This commit addresses the issue of gpfdist logs continuously growing
and occupying a large amount of disk space in a persistent working
scenario. To avoid uncontrolled growth, the log rotation feature is
implemented. The characteristics are as follows:

Define the macro for log size as MAX_GPFDIST_LOGSIZE=512MB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants