Skip to content

Type Casting Error in pandasGenerateText with numpy.clip #314

@DevendraTomar

Description

@DevendraTomar

We encountered an issue while generating data in Databricks using the following schema configuration:

[{"column_name":"id","data_type":"int","options":{"minValue":1,"step":1,"random":false,"percentNulls":0}},{"column_name":"name","data_type":"string","options":{"words":[2,10],"random":true,"percentNulls":0}},{"column_name":"dob","data_type":"date","options":{"begin":"1990-02-20","end":"2023-01-10","random":false,"percentNulls":0}}]

An exception was raised in the Python worker while calling the pandasGenerateText method in dbldatagen (version 0.4.0). The error occurs when the numpy.clip function attempts to cast output from float64 to uint8 using the same_kind casting rule, leading to a UFuncOutputCastingError.

Stack Trace:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbldatagen/text_generators.py", line 881, in pandasGenerateText
    results = self.generateText(rows, rows.size)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/dbldatagen/text_generators.py", line 768, in generateText
    para_stats = np.clip(para_stats_raw, self._minValues, self._maxValues, out=stats_array)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 2169, in clip
    return _wrapfunc(a, 'clip', a_min, a_max, out=out, **kwargs)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 68, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 45, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/numpy/core/_methods.py", line 99, in _clip
    return um.clip(a, min, max, out=out, **kwargs)
numpy.core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'clip' output from dtype('float64') to dtype('uint8') with casting rule 'same_kind'

The numpy.clip operation should handle type conversion correctly or adjust the target type to avoid casting conflicts.

Environment
Databricks: 14.3 LTS (Apache Spark 3.5.0, Scala 2.12)
Python: 3.10
dbldatagen: 0.4.0
numpy: 1.26.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions