-
Notifications
You must be signed in to change notification settings - Fork 176
Extending gpfdist in Cloudberry Database to Support SFTP Protocol for… #1226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ZTE-EBASE thanks for your contribution. Some errors were returned, and the build process needs to be fixed.
In the Dependencies section of the PR, I have attached the dependency information. Please introduce the dependencies to build a new CI environment. The ssh2 library needs to be introduced during compilation and placed under /usr/local. In file included from url_file.c:21:
../../../../src/include/fstream/gfile.h:29:10: fatal error: libssh2.h: No such file or directory
29 | #include <libssh2.h>
| ^~~~~~~~~~~
compilation terminated.
make[4]: *** [<builtin>: url_file.o] Error 1 |
Maybe we update the dependency in the repo https://github.com/apache/cloudberry-devops-release |
Thank you, I have updated. Please proceed with the compilation in this repository. |
How can we install and introduce the dependencies of |
It could be done in https://github.com/apache/cloudberry-devops-release repository. See how we get and compile Note that work has to be done for all supported OSes (Rocky 8 & 9 currently). |
Hey @ZTE-EBASE would you like to squash your commits into one? Then I can guide you on this. |
… Data Ingestion gpfdist is a file distribution program in Cloudberry that can parallel load external data into the database. However, it has the drawback that data files must reside on the same machine as the tool. Therefore,extending it to support the SFTP protocol can address the above drawback and enable loading files from a remote server. Add the libssh2 library and specify the link. Implement remote data file reading using the libssh2 library with gpfdist. Extending gpfdist in Cloudberry Database to Support SFTP Protocol for Data Ingestion -- add LIBSSH2 macro ADD LIBSSH2 macro Label the SFTP-related code to indicate its characteristics.
c8be423
to
1119cf9
Compare
@@ -103,4 +127,11 @@ void gfile_printf_then_putc_newline(const char*format,...) pg_attribute_printf(1 | |||
void*gfile_malloc(size_t size); | |||
void gfile_free(void*a); | |||
|
|||
#ifdef LIBSSH2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where defined LIBSSH2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it enabled by default or through a parameter option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when configure --enable-gpfdist
, we can dynamic check libssh2, when exist then add LIBSSH2
macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit complicated, as I haven't really worked with this before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer to #1151
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In file included from url_file.c:21:
../../../../src/include/fstream/gfile.h:75:9: error: unknown type name ‘LIBSSH2_SESSION’
75 | LIBSSH2_SESSION *session;
| ^~~~~~~~~~~~~~~
../../../../src/include/fstream/gfile.h:76:9: error: unknown type name ‘LIBSSH2_SFTP’
76 | LIBSSH2_SFTP *sftp_session;
| ^~~~~~~~~~~~
../../../../src/include/fstream/gfile.h:77:9: error: unknown type name ‘LIBSSH2_SFTP_HANDLE’
77 | LIBSSH2_SFTP_HANDLE *sftp_handle;
| ^~~~~~~~~~~~~~~~~~~
#ifdef LIBSSH2
#include <libssh2.h>
#include <libssh2_sftp.h>
#endif
It's clear that there is no dependency on the libssh2 library. There are header files, but that's it. As mentioned earlier, you said that you would check for the existence of LIBSSH2 through --enable-gpfdist. Does that mean it's up to me to control it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, It's your responsibility.
Hi @ZTE-EBASE, here is some feedback on your commit message: To be updated:
Here is one sample based one your commit message for your reference:
|
gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit extends gpfdist to support the SFTP protocol, enabling users to ingest data directly from remote servers. This enhancement streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify SFTP locations. Key change include: when configure --enable-gpfdist, can dynamic check libssh2, if exist then add LIBSSH2 macro.
configure.ac
Outdated
# Check libssh2 >= 1.0.0 | ||
PKG_CHECK_MODULES([LIBSSH2], [libssh2 >= 1.0.0], | ||
[AC_DEFINE([LIBSSH2], [1], [Define if libssh2 is available])], | ||
[AC_MSG_ERROR([libssh2 >= 1.0.0 is required for gpfdist support])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PKG_CHECK_MODULES([LIBSSH2], [libssh2 >= 1.0.0],
[AC_DEFINE([LIBSSH2], [1], [Define if libssh2 is available])],
[AC_MSG_WARN([libssh2 >= 1.0.0 not found, gpfdist will build without libssh2 support])]
)
If libssh2 is not found, only a warning is issued (AC_MSG_WARN), the macro is not defined, and the configure process does not fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll make the changes. Thank you!
ingestion gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit extends gpfdist to support the SFTP protocol, enabling users to ingest data directly from remote servers. This enhancement streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify SFTP locations. Key change include: Adding the `LIBSSH2` preprocessor macro to conditionally compile the new SFTP-related code.when configure --enable-gpfdist, we can dynamic check libssh2, if exist then add LIBSSH2 macro. else will print warning information.
gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit extends gpfdist to support the SFTP protocol, enabling users to ingest data directly from remote servers. This enhancement streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify SFTP locations. Key change include: Fix the compilation errors related to the libssh2 library option in the configure file.
libssh2-devel is introduced as a dependency package for the new feature in the PR apache/cloudberry#1226.
Hey, when test this PR, some errors returned: Build Env
Then
Then run (following this guide):
Then, errors were returned:
When running
|
The dependency |
The issue has been identified and resolved, but please review it. |
Thanks for your fix. Now, it has been tested and can be built successfully! |
gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit extends gpfdist to support the SFTP protocol, enabling users to ingest data directly from remote servers. This enhancement streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify SFTP locations. Key change include: Fix the compilation errors related to the libssh2 library option in the configure file.
Feature:Supporting the Loading of bz Format Files gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit extends gpfdist to support the SFTP protocol, enabling users to ingest data directly from remote servers. This enhancement streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify SFTP locations. Key change information: Implement the loading of .bz2 files by utilizing the read functions provided by the libssh2 library. Add SFTP support to gpfdist for data ingestion Feature:Supporting the Loading of bz(bz2) Format Files gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit extends gpfdist to support the SFTP protocol, enabling users to ingest data directly from remote servers. This enhancement streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify SFTP locations. Key change information: Implement the loading of .bz2 files by utilizing the read functions provided by the libssh2 library.
5ad1ced
to
3193c63
Compare
Feature:Supporting the Loading of gz Format Files gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit extends gpfdist to support the SFTP protocol, enabling users to ingest data directly from remote servers. This enhancement streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify SFTP locations. Key change information: Implement the loading of gz files by utilizing the read functions provided by the libssh2 library.
Feature:Support for SFTP server data access with IPv6 addresses. gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit extends gpfdist to support the SFTP protocol, enabling users to ingest data directly from remote servers. This enhancement streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify SFTP locations. Key change information: Support for SFTP server data access based on address type, including both IPv4 and IPv6 addresses.
Feature:Support for writing CloudBerry table data to a remote SFTP server to achieve backup functionality. gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit extends gpfdist to support the SFTP protocol, enabling users to ingest data directly from remote servers. This enhancement streamlines ETL workflows by allowing `CREATE EXTERNAL TABLE` to specify SFTP locations. Key change information: Implement the `sftp_write` function to write CloudBerry table data to a remote SFTP server, thereby achieving backup functionality.
Implement the log rotation feature for gpfdist. gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit addresses the issue of gpfdist logs continuously growing and occupying a large amount of disk space in a persistent working scenario. To avoid uncontrolled growth, the log rotation feature is implemented. The characteristics are as follows: 1) Logs will be rotated when their size exceeds 512MB; 2) Only two logs are kept in the log set, one is the latest current log, and the other is the previous rotated log.
Implement the log rotation feature for gpfdist. gpfdist, Cloudberry's parallel file distribution program, traditionally required data files to be co-located with the gpfdist process. This limitation made it cumbersome to load data from remote servers, often requiring an extra data transfer step. This commit addresses the issue of gpfdist logs continuously growing and occupying a large amount of disk space in a persistent working scenario. To avoid uncontrolled growth, the log rotation feature is implemented. The characteristics are as follows: Define the macro for log size as MAX_GPFDIST_LOGSIZE=512MB
… Data Ingestion
gpfdist is a file distribution program in Cloudberry that can parallel load external data into the database. However, it has the drawback that data files must reside on the same machine as the tool. Therefore,extending it to support the SFTP protocol can address the above drawback and enable loading files from a remote server.
Fixes #ISSUE_Number
What does this PR do?
By extending the
gpfdist
tool to support the SFTP protocol, remote data loading has been achieved, overcoming the challenge of having the tool and data files on the same machine.Type of Change
New feature (non-breaking change)
Test Plan
make installcheck
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
The ssh2 library needs to be introduced during compilation and placed under
/usr/local
.Checklist
Additional Context
Under this approach, the location template for the external table is:
Related Test Case:
1 Start gpfdist
2 create table (external)
3 data load
4 result
cat test.csv
1|ZTE-EBASE
2|ZTE-EBASE
3|ZTE-EBASE
4|ZTE-EBASE
5|ZTE-EBASE
6|ZTE-EBASE
7|ZTE-EBASE
8|ZTE-EBASE
9|ZTE-EBASE
10|ZTE-EBASE
The amount and content of the table data are consistent with the file.
CI Skip Instructions