Skip to content

Commit 8f382b0

Browse files
committed
Merge branch 'master' of https://github.com/modin-project/modin into fix-testsw
2 parents b05cc60 + 0dfd88d commit 8f382b0

File tree

103 files changed

+4452
-1430
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

103 files changed

+4452
-1430
lines changed

.github/actions/mamba-env/action.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,7 @@ runs:
4242
# we set use-only-tar-bz2 to false in order for conda to properly find new packages to be installed
4343
# for more info see https://github.com/conda-incubator/setup-miniconda/issues/264
4444
use-only-tar-bz2: false
45+
- shell: bash -l {0}
46+
run: |
47+
conda run -n ${{ inputs.activate-environment }} pip install .
48+
conda list -n ${{ inputs.activate-environment }}

.github/actions/run-core-tests/group_2/action.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,3 @@ runs:
2020
modin/pandas/test/dataframe/test_pickle.py
2121
echo "::endgroup::"
2222
shell: bash -l {0}
23-
- run: MODIN_RANGE_PARTITIONING=1 python -m pytest modin/pandas/test/dataframe/test_join_sort.py -k "merge"
24-
shell: bash -l {0}

.github/actions/run-core-tests/group_3/action.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,12 @@ runs:
1818
echo "::endgroup::"
1919
shell: bash -l {0}
2020
- run: |
21-
echo "::group::Running experimental groupby tests (group 3)..."
21+
echo "::group::Running range-partitioning tests (group 3)..."
2222
MODIN_RANGE_PARTITIONING_GROUPBY=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_groupby.py
23+
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_series.py -k "test_unique or test_nunique or drop_duplicates or test_resample"
24+
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_general.py -k "test_unique"
25+
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/dataframe/test_map_metadata.py -k "drop_duplicates"
26+
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/dataframe/test_join_sort.py -k "merge"
27+
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/dataframe/test_default.py -k "resample"
2328
echo "::endgroup::"
2429
shell: bash -l {0}

.github/workflows/ci.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,6 @@ jobs:
188188
- run: python -m pytest modin/pandas/test/dataframe/test_binary.py
189189
- run: python -m pytest modin/pandas/test/dataframe/test_reduce.py
190190
- run: python -m pytest modin/pandas/test/dataframe/test_join_sort.py
191-
- run: MODIN_RANGE_PARTITIONING=1 python -m pytest modin/pandas/test/dataframe/test_join_sort.py -k "merge"
192191
- run: python -m pytest modin/pandas/test/test_general.py
193192
- run: python -m pytest modin/pandas/test/dataframe/test_indexing.py
194193
- run: python -m pytest modin/pandas/test/test_series.py

.github/workflows/push-to-master.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@ jobs:
4646
python -m pytest modin/pandas/test/dataframe/test_indexing.py
4747
python -m pytest modin/pandas/test/dataframe/test_iter.py
4848
python -m pytest modin/pandas/test/dataframe/test_join_sort.py
49-
MODIN_RANGE_PARTITIONING=1 python -m pytest modin/pandas/test/dataframe/test_join_sort.py -k "merge"
5049
python -m pytest modin/pandas/test/dataframe/test_map_metadata.py
5150
python -m pytest modin/pandas/test/dataframe/test_reduce.py
5251
python -m pytest modin/pandas/test/dataframe/test_udf.py

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ Modin is a drop-in replacement for [pandas](https://github.com/pandas-dev/pandas
2424
single-threaded, Modin lets you instantly speed up your workflows by scaling pandas so it uses all of your
2525
cores. Modin works especially well on larger datasets, where pandas becomes painfully slow or runs
2626
[out of memory](https://modin.readthedocs.io/en/latest/getting_started/why_modin/out_of_core.html).
27+
Also, Modin comes with the [additional APIs](https://modin.readthedocs.io/en/latest/usage_guide/advanced_usage/index.html#additional-apis)
28+
to improve user experience.
2729

2830
By simply replacing the import statement, Modin offers users effortless speed and scale for their pandas workflows:
2931

asv_bench/benchmarks/utils/common.py

Lines changed: 13 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -114,11 +114,7 @@ def gen_nan_data(nrows: int, ncols: int) -> dict:
114114

115115
def gen_int_data(nrows: int, ncols: int, rand_low: int, rand_high: int) -> dict:
116116
"""
117-
Generate int data with caching.
118-
119-
The generated data are saved in the dictionary and on a subsequent call,
120-
if the keys match, saved data will be returned. Therefore, we need
121-
to carefully monitor the changing of saved data and make its copy if needed.
117+
Generate int data.
122118
123119
Parameters
124120
----------
@@ -136,30 +132,16 @@ def gen_int_data(nrows: int, ncols: int, rand_low: int, rand_high: int) -> dict:
136132
dict
137133
Number of keys - `ncols`, each of them store np.ndarray of `nrows` length.
138134
"""
139-
cache_key = ("int", nrows, ncols, rand_low, rand_high)
140-
if cache_key in data_cache:
141-
return data_cache[cache_key]
142-
143-
logging.info(
144-
"Generating int data {} rows and {} columns [{}-{}]".format(
145-
nrows, ncols, rand_low, rand_high
146-
)
147-
)
148135
data = {
149136
"col{}".format(i): np.random.randint(rand_low, rand_high, size=(nrows))
150137
for i in range(ncols)
151138
}
152-
data_cache[cache_key] = weakdict(data)
153139
return data
154140

155141

156142
def gen_str_int_data(nrows: int, ncols: int, rand_low: int, rand_high: int) -> dict:
157143
"""
158-
Generate int data and string data with caching.
159-
160-
The generated data are saved in the dictionary and on a subsequent call,
161-
if the keys match, saved data will be returned. Therefore, we need
162-
to carefully monitor the changing of saved data and make its copy if needed.
144+
Generate int data and string data.
163145
164146
Parameters
165147
----------
@@ -178,30 +160,16 @@ def gen_str_int_data(nrows: int, ncols: int, rand_low: int, rand_high: int) -> d
178160
Number of keys - `ncols`, each of them store np.ndarray of `nrows` length.
179161
One of the columns with string values.
180162
"""
181-
cache_key = ("str_int", nrows, ncols, rand_low, rand_high)
182-
if cache_key in data_cache:
183-
return data_cache[cache_key]
184-
185-
logging.info(
186-
"Generating str_int data {} rows and {} columns [{}-{}]".format(
187-
nrows, ncols, rand_low, rand_high
188-
)
189-
)
190163
data = gen_int_data(nrows, ncols, rand_low, rand_high).copy()
191164
# convert values in arbitary column to string type
192165
key = list(data.keys())[0]
193166
data[key] = [f"str_{x}" for x in data[key]]
194-
data_cache[cache_key] = weakdict(data)
195167
return data
196168

197169

198170
def gen_true_false_int_data(nrows, ncols, rand_low, rand_high):
199171
"""
200-
Generate int data and string data "true" and "false" values with caching.
201-
202-
The generated data are saved in the dictionary and on a subsequent call,
203-
if the keys match, saved data will be returned. Therefore, we need
204-
to carefully monitor the changing of saved data and make its copy if needed.
172+
Generate int data and string data "true" and "false" values.
205173
206174
Parameters
207175
----------
@@ -221,15 +189,6 @@ def gen_true_false_int_data(nrows, ncols, rand_low, rand_high):
221189
One half of the columns with integer values, another half - with "true" and
222190
"false" string values.
223191
"""
224-
cache_key = ("true_false_int", nrows, ncols, rand_low, rand_high)
225-
if cache_key in data_cache:
226-
return data_cache[cache_key]
227-
228-
logging.info(
229-
"Generating true_false_int data {} rows and {} columns [{}-{}]".format(
230-
nrows, ncols, rand_low, rand_high
231-
)
232-
)
233192
data = gen_int_data(nrows // 2, ncols // 2, rand_low, rand_high)
234193

235194
data_true_false = {
@@ -239,7 +198,6 @@ def gen_true_false_int_data(nrows, ncols, rand_low, rand_high):
239198
for i in range(ncols - ncols // 2)
240199
}
241200
data.update(data_true_false)
242-
data_cache[cache_key] = weakdict(data)
243201
return data
244202

245203

@@ -289,10 +247,20 @@ def gen_data(
289247
"str_int": gen_str_int_data,
290248
"true_false_int": gen_true_false_int_data,
291249
}
250+
cache_key = (data_type, nrows, ncols, rand_low, rand_high)
251+
if cache_key in data_cache:
252+
return data_cache[cache_key]
253+
254+
logging.info(
255+
"Generating {} data {} rows and {} columns [{}-{}]".format(
256+
data_type, nrows, ncols, rand_low, rand_high
257+
)
258+
)
292259
assert data_type in type_to_generator
293260
data_generator = type_to_generator[data_type]
294261

295262
data = data_generator(nrows, ncols, rand_low, rand_high)
263+
data_cache[cache_key] = weakdict(data)
296264

297265
return data
298266

docs/development/contributing.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,8 @@ or ``--signoff`` to your usual ``git commit`` commands:
6363

6464
.. code-block:: bash
6565
66-
git commit --signoff
67-
git commit -s
66+
git commit --signoff -m "This is my commit message"
67+
git commit -s -m "This is my commit message"
6868
6969
This will use your default git configuration which is found in .git/config. To change
7070
this, you can use the following commands:

docs/ecosystem.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,5 +45,32 @@ where NumPy can be used and what libraries it powers.
4545
4646
numpy_arr = to_numpy(modin_df)
4747
48+
to_ray
49+
------
50+
51+
You can refer to `Ray Data`_ page to get more details on
52+
where Ray Dataset can be used and what libraries it powers.
53+
54+
.. code-block:: python
55+
56+
from modin.pandas.io import to_ray
57+
58+
ray_dataset = to_ray(modin_df)
59+
60+
to_dask
61+
-------
62+
63+
You can refer to `Dask DataFrame`_ page to get more details on
64+
where Dask DataFrame can be used and what libraries it powers.
65+
66+
.. code-block:: python
67+
68+
from modin.pandas.io import to_dask
69+
70+
dask_df = to_dask(modin_df)
71+
4872
.. _pandas ecosystem: https://pandas.pydata.org/community/ecosystem.html
4973
.. _NumPy ecosystem: https://numpy.org
74+
.. _Ray Data: https://docs.ray.io/en/latest/data/data.html
75+
.. _Dask DataFrame: https://docs.dask.org/en/stable/dataframe.html
76+

docs/flow/modin/config.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,3 +56,30 @@ API.
5656
# Changing value of `NPartitions`
5757
modin.config.NPartitions.put(16)
5858
print(modin.config.NPartitions.get()) # prints '16'
59+
60+
One can also use config variables with a context manager in order to use
61+
some config only for a certain part of the code:
62+
63+
.. code-block:: python
64+
65+
import modin.config as cfg
66+
67+
# Default value for this config is 'False'
68+
print(cfg.RangePartitioning.get()) # False
69+
70+
# Set the config to 'True' inside of the context-manager
71+
with cfg.context(RangePartitioning=True):
72+
print(cfg.RangePartitioning.get()) # True
73+
df.merge(...) # will use range-partitioning impl
74+
75+
# Once the context is over, the config gets back to its previous value
76+
print(cfg.RangePartitioning.get()) # False
77+
78+
# You can also set multiple config at once when you pass a dictionary to 'cfg.context'
79+
print(cfg.AsyncReadMode.get()) # False
80+
81+
with cfg.context(RangePartitioning=True, AsyncReadMode=True):
82+
print(cfg.RangePartitioning.get()) # True
83+
print(cfg.AsyncReadMode.get()) # True
84+
print(cfg.RangePartitioning.get()) # False
85+
print(cfg.AsyncReadMode.get()) # False

0 commit comments

Comments
 (0)