New transformer for license and copyright header removal #332

ykalathiya · 2024-06-25T06:19:20Z

Why are these changes needed?

It's a new transform module which removes license and copyright header from the input code data. This transforms module depends on (scancode-toolkit)[https://pypi.org/project/scancode-toolkit].

Related issue number (if any).

Closes: #63

daw3rd

You need a license_copyright_remove/README.md

Can we find a shorter name? Maybe legal_removal, common_removal?

transforms/code/license_copyright_removal/ray/requirements.txt

transforms/code/license_copyright_removal/python/requirements.txt

transforms/code/license_copyright_removal/python/src/license_copyright_removal_transform.py

transforms/code/license_copyright_removal/ray/Dockerfile

transforms/code/license_copyright_removal/ray/Makefile

transforms/code/license_copyright_removal/ray/README.md

transforms/code/legal_removal/python/Makefile

transforms/code/legal_removal/ray/Makefile

transforms/code/legal_removal/ray/README.md

daw3rd · 2024-06-25T15:32:24Z

You need a license_copyright_remove/README.md

Can we find a shorter name? Maybe legal_removal, common_removal?

Excellent! Thanks.

Bytes-Explorer · 2024-06-25T15:35:05Z

You need a license_copyright_remove/README.md

Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

ykalathiya · 2024-06-25T15:41:56Z

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

legal_clean or code_clean?

ykalathiya · 2024-06-25T15:42:58Z

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

Bytes-Explorer · 2024-06-25T15:58:01Z

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

Bytes-Explorer · 2024-06-25T15:59:18Z

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

legal_removal is too broad and could be misleading. Same with code_clean. Please suggest something so that one knows what is the functionality of this module.

ykalathiya · 2024-06-25T16:02:18Z

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

Functionality is to remove license and copyright.

ykalathiya · 2024-06-25T16:47:44Z

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

legal_removal is too broad and could be misleading. Same with code_clean. Please suggest something so that one knows what is the functionality of this module.

Then we have to use license or copyright in name which makes the bigger name or we can go with only one functionality like license_cleaner.

Param-S · 2024-06-26T17:05:42Z

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.

ykalathiya · 2024-06-26T18:03:40Z

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.
Okk that’s a good name. I’ll change the module name

daw3rd · 2024-06-26T23:16:04Z

Also, can you please run make conventions in each of the ray and python directories to be sure there are not MUSTs.

Bytes-Explorer · 2024-06-27T04:04:32Z

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.

I am fine with that

Bytes-Explorer · 2024-06-27T04:05:00Z

@ykalathiya How many languages will this work for? Have we done any testing?

ykalathiya · 2024-06-27T07:44:49Z

@ykalathiya How many languages will this work for? Have we done any testing?

I have used this repo https://github.com/arjuncvinod/Hello-World-in-Different-Languages .
Added license manually in file and it detected all.
This is the list of language.
[
'js', 'intercal', 'ejs', 'php', 'cl', 'vhd', 'fs', 'applescript', 'ahk', 'java', 's', 'xml',
'xml', 'sol', 'cbl', 'ts', 'a68', 'ml', 'swift', 'coffee', 'chpl', 'pas', 'jsp', 'asm', 'sc',
'mat', 'txt', 'be', 'go', 'cpp', 'dart', 'bhai', 'erl', 'ps1', 'mojo', 'nut', 'chef', 'BAS',
'pyx', 'css', 'tcl', 'vb', 'py', 'm', 'lua', 'for', 'jl', 'ps', 'f95', 'ts', 'rb', 'rkt',
'sql', 'factor', 'nix', 'e', 'sh', 'pas', 'c', 'factor', 'sh', 'abap', 'js', 'jaksel', 'zig',
'bas', 't', 'bf', 'ex', 'txt', 'asm', 'r', 'hack', 'bas', 'lsp', 'pl', 'kt', 'st', 'pike',
'hs', 'm', 'lol', 'ads', 'v', 'fish', 'sml', 'sh', 'fth', 'jl', 'java', 'sas', 'rs'
]

Bytes-Explorer · 2024-06-27T07:48:07Z

https://github.com/arjuncvinod/Hello-World-in-Different-Languages

Fantastic! Thank you!

ykalathiya · 2024-06-27T09:07:23Z

Also, can you please run make conventions in each of the ray and python directories to be sure there are not MUSTs.

make conventions run successfully on both python and ray.

transforms/code/header_cleanser/README.md

Param-S · 2024-06-27T09:53:21Z

transforms/code/header_cleanser/python/README.md

+
+This module is designed to detect and remove license and copyright information from code files. It leverages the [ScanCode Toolkit](https://pypi.org/project/scancode-toolkit/) to accurately identify and process licenses and copyrights in various programming languages.
+
+After detecting license and copyright position code has been stored at same column. Now lines which doesn't contain license or copyright copied to same position.


this statement is not very clear:

After detecting license and copyright position code has been stored at same column. Now lines which doesn't contain license or copyright copied to same position.

You mean to say "After locating the position of license or copyright in the input code/sample, this module delete/remove those lines and returns the updated code. There is no change for the input samples which do not contain the license or copyright header.

transforms/code/legal_removal/python/README.md

transforms/code/header_cleanser/README.md

daw3rd · 2024-06-28T00:22:56Z

Could you please merge dev into this branch given some recent issues with versioning. See #355

ykalathiya · 2024-06-29T10:18:17Z

I have added kfp_ray directory and also in README file.

Param-S

LGTM

daw3rd

I have some more review to do, but a small start.

transforms/code/header_cleanser/python/src/header_cleanser_transform.py

transforms/code/header_cleanser/kfp_ray/README.md

blublinsky · 2024-07-01T08:20:41Z

transforms/code/header_cleanser/kfp_ray/header_cleanser_wf.py

+    runtime_actor_options: str = "{'num_cpus': 0.8}",
+    runtime_pipeline_id: str = "runtime_pipeline_id",
+    runtime_code_location: str = "{'github': 'github', 'commit_hash': '12345', 'path': 'path'}",
+    # code quality parameters


code quality?

blublinsky · 2024-07-01T08:21:19Z

transforms/code/header_cleanser/kfp_ray/header_cleanser_wf.py

+    runtime_pipeline_id: str,
+    runtime_job_id: str,
+    runtime_code_location: str,
+) -> dict:


missing specific parameters

blublinsky · 2024-07-01T08:21:49Z

transforms/code/header_cleanser/kfp_ray/header_cleanser_wf.py

+        "runtime_pipeline_id": runtime_pipeline_id,
+        "runtime_job_id": runtime_job_id,
+        "runtime_code_location": runtime_code_location,
+    }


missing specific parameters

blublinsky · 2024-07-01T08:23:25Z

transforms/code/header_cleanser/kfp_ray/header_cleanser_wf.py

+    contents_column_name: str = "contents",
+    license: str = "true",
+    copyright: str = "true",
+    # additional parameters


the parameters above should have prefix. Also you can define license and copyright as boolean

blublinsky · 2024-07-01T08:26:17Z

transforms/code/header_cleanser/python/src/header_cleanser_transform.py

+DEFAULT_COLUMN = "contents"
+DEFAULT_LICENSE = "true"
+DEFAULT_COPYRIGHT = "true"
+


These should be boolean

boolean type gives error while running make-test. That's why I used str format.
error :
ERROR collecting test/test_license_copyright_removal.py ________________________
test_license_copyright_removal.py:70: in get_test_transform_fixtures
self.create_license_copyright_removal_test_fixture(
test_license_copyright_removal.py:52: in create_license_copyright_removal_test_fixture
config = get_transform_config(ftc, cli)
../../../../../data-processing-lib/python/src/data_processing/transform/transform_configuration.py:100: in get_transform_config
args = parser.parse_args(argv)
/usr/lib/python3.11/argparse.py:1886: in parse_args
args, argv = self.parse_known_args(args, namespace)
/usr/lib/python3.11/argparse.py:1919: in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
/usr/lib/python3.11/argparse.py:1962: in _parse_known_args
option_tuple = self._parse_optional(arg_string)
/usr/lib/python3.11/argparse.py:2261: in _parse_optional
if not arg_string[0] in self.prefix_chars:
E TypeError: 'bool' object is not subscriptable

Yes, take a look at https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code2parquet/python/src/code2parquet_transform.py#L184

blublinsky · 2024-07-01T08:30:55Z

transforms/code/header_cleanser/python/src/header_cleanser_transform.py

+
+            else : 
+                return [table],{'Removed code count' : remove_code_count}
+


The logic above does not seem correct. Can you, please, double check

I think this works fine.
I have provided value of copyright and license as empty string and this is output of file.
"python header_cleanser_local.py
input table has 10 rows

output table has 10 rows
Output metadata : {'Removed code count': 0}"
You can check it by runnig header_cleanser_local.py.

It does not. Your initial test:

self.license_remove=="true" and self.copyright_remove=="true":

and your second and third tests are identical. I think the second one should be:

elif self.copyright_remove=="true":

and the third one:

elif self.license_remove=='true':

transforms/code/header_cleanser/python/Dockerfile

transforms/code/header_cleanser/README.md

This module use scancode.api to detect license and copyright Signed-off-by: Yash Kalathiya <[email protected]>

Signed-off-by: Yash Kalathiya <[email protected]>

ykalathiya changed the title ~~License header~~ New transformer for license and copyright header removal Jun 25, 2024

Param-S requested review from Param-S, sapthasurendran and shivdeep-singh-ibm June 25, 2024 06:20

daw3rd requested changes Jun 25, 2024

View reviewed changes

transforms/code/license_copyright_removal/ray/requirements.txt Outdated Show resolved Hide resolved

transforms/code/license_copyright_removal/python/requirements.txt Outdated Show resolved Hide resolved

ykalathiya force-pushed the licenseHeader branch 2 times, most recently from edab4ba to ad9b01d Compare June 25, 2024 15:10

daw3rd requested changes Jun 25, 2024

View reviewed changes

transforms/code/legal_removal/ray/Makefile Outdated Show resolved Hide resolved

transforms/code/legal_removal/ray/README.md Outdated Show resolved Hide resolved

ykalathiya force-pushed the licenseHeader branch from ad9b01d to 111470c Compare June 26, 2024 08:52

ykalathiya requested a review from daw3rd June 26, 2024 08:53

ykalathiya force-pushed the licenseHeader branch from 111470c to 7e07f63 Compare June 26, 2024 15:02

Param-S requested changes Jun 27, 2024

View reviewed changes

transforms/code/header_cleanser/README.md Outdated Show resolved Hide resolved

Param-S reviewed Jun 27, 2024

View reviewed changes

ykalathiya force-pushed the licenseHeader branch from 4265b1e to a026f2b Compare June 27, 2024 15:43

daw3rd requested changes Jun 27, 2024

View reviewed changes

transforms/code/legal_removal/python/README.md Outdated Show resolved Hide resolved

transforms/code/header_cleanser/README.md Outdated Show resolved Hide resolved

ykalathiya force-pushed the licenseHeader branch from a026f2b to b50f5ca Compare June 28, 2024 15:06

ykalathiya requested review from daw3rd and Param-S June 29, 2024 07:09

ykalathiya force-pushed the licenseHeader branch from b50f5ca to 046ee97 Compare June 29, 2024 10:17

Param-S approved these changes Jun 29, 2024

View reviewed changes

daw3rd requested changes Jul 1, 2024

View reviewed changes

transforms/code/header_cleanser/python/src/header_cleanser_transform.py Outdated Show resolved Hide resolved

ykalathiya force-pushed the licenseHeader branch from 046ee97 to 0f9651a Compare July 1, 2024 05:49

ykalathiya requested a review from daw3rd July 1, 2024 05:49

blublinsky reviewed Jul 1, 2024

View reviewed changes

ykalathiya force-pushed the licenseHeader branch from 0f9651a to 11891a8 Compare July 1, 2024 15:19

ykalathiya requested a review from blublinsky July 2, 2024 03:27

daw3rd reviewed Jul 2, 2024

View reviewed changes

transforms/code/header_cleanser/python/Dockerfile Outdated Show resolved Hide resolved

transforms/code/header_cleanser/README.md Outdated Show resolved Hide resolved

ykalathiya added 12 commits July 3, 2024 20:58

new module to remove license and copyright header

deab145

This module use scancode.api to detect license and copyright Signed-off-by: Yash Kalathiya <[email protected]>

Updated Makefile

33c8e50

Signed-off-by: Yash Kalathiya <[email protected]>

Updated makefile and transformer

b1602d6

Signed-off-by: Yash Kalathiya <[email protected]>

Updated project version

009a4b4

Signed-off-by: Yash Kalathiya <[email protected]>

Added make version command

14d1fdc

Signed-off-by: Yash Kalathiya <[email protected]>

Changed name of module

170e115

Signed-off-by: Yash Kalathiya <[email protected]>

Updated makefile and README

780c3c7

Signed-off-by: Yash Kalathiya <[email protected]>

Renamed module

12ba45e

Signed-off-by: Yash Kalathiya <[email protected]>

changed makefile and readme

2163774

Signed-off-by: Yash Kalathiya <[email protected]>

Added kfp_ray

bbaa426

Signed-off-by: Yash Kalathiya <[email protected]>

fixed some bug

505fff0

Signed-off-by: Yash Kalathiya <[email protected]>

Changed docker file

081c1cf

Signed-off-by: Yash Kalathiya <[email protected]>

ykalathiya force-pushed the licenseHeader branch from 11891a8 to 081c1cf Compare July 3, 2024 17:33

ykalathiya requested a review from daw3rd July 3, 2024 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New transformer for license and copyright header removal #332

New transformer for license and copyright header removal #332

ykalathiya commented Jun 25, 2024 •

edited by Param-S

Loading

daw3rd left a comment

daw3rd commented Jun 25, 2024

Bytes-Explorer commented Jun 25, 2024

ykalathiya commented Jun 25, 2024

ykalathiya commented Jun 25, 2024

Bytes-Explorer commented Jun 25, 2024

Bytes-Explorer commented Jun 25, 2024

ykalathiya commented Jun 25, 2024

ykalathiya commented Jun 25, 2024

Param-S commented Jun 26, 2024 •

edited

Loading

ykalathiya commented Jun 26, 2024

daw3rd commented Jun 26, 2024

Bytes-Explorer commented Jun 27, 2024

Bytes-Explorer commented Jun 27, 2024

ykalathiya commented Jun 27, 2024

Bytes-Explorer commented Jun 27, 2024

ykalathiya commented Jun 27, 2024

Param-S Jun 27, 2024

daw3rd commented Jun 28, 2024

ykalathiya commented Jun 29, 2024

Param-S left a comment

daw3rd left a comment

blublinsky Jul 1, 2024

blublinsky Jul 1, 2024

blublinsky Jul 1, 2024

blublinsky Jul 1, 2024

blublinsky Jul 1, 2024

ykalathiya Jul 1, 2024

blublinsky Jul 1, 2024

blublinsky Jul 1, 2024

ykalathiya Jul 1, 2024

blublinsky Jul 1, 2024


		This module is designed to detect and remove license and copyright information from code files. It leverages the [ScanCode Toolkit](https://pypi.org/project/scancode-toolkit/) to accurately identify and process licenses and copyrights in various programming languages.

		After detecting license and copyright position code has been stored at same column. Now lines which doesn't contain license or copyright copied to same position.


		else :
		return [table],{'Removed code count' : remove_code_count}

New transformer for license and copyright header removal #332

Are you sure you want to change the base?

New transformer for license and copyright header removal #332

Conversation

ykalathiya commented Jun 25, 2024 • edited by Param-S Loading

Why are these changes needed?

Related issue number (if any).

daw3rd left a comment

Choose a reason for hiding this comment

daw3rd commented Jun 25, 2024

Bytes-Explorer commented Jun 25, 2024

ykalathiya commented Jun 25, 2024

ykalathiya commented Jun 25, 2024

Bytes-Explorer commented Jun 25, 2024

Bytes-Explorer commented Jun 25, 2024

ykalathiya commented Jun 25, 2024

ykalathiya commented Jun 25, 2024

Param-S commented Jun 26, 2024 • edited Loading

ykalathiya commented Jun 26, 2024

daw3rd commented Jun 26, 2024

Bytes-Explorer commented Jun 27, 2024

Bytes-Explorer commented Jun 27, 2024

ykalathiya commented Jun 27, 2024

Bytes-Explorer commented Jun 27, 2024

ykalathiya commented Jun 27, 2024

Choose a reason for hiding this comment

daw3rd commented Jun 28, 2024

ykalathiya commented Jun 29, 2024

Param-S left a comment

Choose a reason for hiding this comment

daw3rd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ykalathiya commented Jun 25, 2024 •

edited by Param-S

Loading

Param-S commented Jun 26, 2024 •

edited

Loading