Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New transformer for license and copyright header removal #332

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from

Conversation

ykalathiya
Copy link

@ykalathiya ykalathiya commented Jun 25, 2024

Why are these changes needed?

It's a new transform module which removes license and copyright header from the input code data. This transforms module depends on (scancode-toolkit)[https://pypi.org/project/scancode-toolkit].

Related issue number (if any).

Closes: #63

@ykalathiya ykalathiya changed the title License header New transformer for license and copyright header removal Jun 25, 2024
Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a license_copyright_remove/README.md

Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya ykalathiya force-pushed the licenseHeader branch 2 times, most recently from edab4ba to ad9b01d Compare June 25, 2024 15:10
transforms/code/legal_removal/ray/Makefile Outdated Show resolved Hide resolved
transforms/code/legal_removal/ray/README.md Outdated Show resolved Hide resolved
@daw3rd
Copy link
Member

daw3rd commented Jun 25, 2024

You need a license_copyright_remove/README.md

Can we find a shorter name? Maybe legal_removal, common_removal?

Excellent! Thanks.

@Bytes-Explorer
Copy link
Collaborator

You need a license_copyright_remove/README.md

Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

@ykalathiya
Copy link
Author

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

legal_clean or code_clean?

@ykalathiya
Copy link
Author

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

@Bytes-Explorer
Copy link
Collaborator

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

@Bytes-Explorer
Copy link
Collaborator

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

legal_removal is too broad and could be misleading. Same with code_clean. Please suggest something so that one knows what is the functionality of this module.

@ykalathiya
Copy link
Author

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

Functionality is to remove license and copyright.

@ykalathiya
Copy link
Author

You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?

@ykalathiya Pls suggest a name that calls out the functionality that this module will do.

Or this legal_remove is good?

The functionality is to remove the copyright headers, right or are we missing anything?

legal_removal is too broad and could be misleading. Same with code_clean. Please suggest something so that one knows what is the functionality of this module.

Then we have to use license or copyright in name which makes the bigger name or we can go with only one functionality like license_cleaner.

@Param-S
Copy link
Collaborator

Param-S commented Jun 26, 2024

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.

@ykalathiya
Copy link
Author

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.
Okk that’s a good name. I’ll change the module name

@daw3rd
Copy link
Member

daw3rd commented Jun 26, 2024

Also, can you please run make conventions in each of the ray and python directories to be sure there are not MUSTs.

@Bytes-Explorer
Copy link
Collaborator

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.

Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file.

I am fine with that

@Bytes-Explorer
Copy link
Collaborator

@ykalathiya How many languages will this work for? Have we done any testing?

@ykalathiya
Copy link
Author

@ykalathiya How many languages will this work for? Have we done any testing?

I have used this repo https://github.com/arjuncvinod/Hello-World-in-Different-Languages .
Added license manually in file and it detected all.
This is the list of language.
[
'js', 'intercal', 'ejs', 'php', 'cl', 'vhd', 'fs', 'applescript', 'ahk', 'java', 's', 'xml',
'xml', 'sol', 'cbl', 'ts', 'a68', 'ml', 'swift', 'coffee', 'chpl', 'pas', 'jsp', 'asm', 'sc',
'mat', 'txt', 'be', 'go', 'cpp', 'dart', 'bhai', 'erl', 'ps1', 'mojo', 'nut', 'chef', 'BAS',
'pyx', 'css', 'tcl', 'vb', 'py', 'm', 'lua', 'for', 'jl', 'ps', 'f95', 'ts', 'rb', 'rkt',
'sql', 'factor', 'nix', 'e', 'sh', 'pas', 'c', 'factor', 'sh', 'abap', 'js', 'jaksel', 'zig',
'bas', 't', 'bf', 'ex', 'txt', 'asm', 'r', 'hack', 'bas', 'lsp', 'pl', 'kt', 'st', 'pike',
'hs', 'm', 'lol', 'ads', 'v', 'fish', 'sml', 'sh', 'fth', 'jl', 'java', 'sas', 'rs'
]

@Bytes-Explorer
Copy link
Collaborator

https://github.com/arjuncvinod/Hello-World-in-Different-Languages

Fantastic! Thank you!

@ykalathiya
Copy link
Author

Also, can you please run make conventions in each of the ray and python directories to be sure there are not MUSTs.

make conventions run successfully on both python and ray.


This module is designed to detect and remove license and copyright information from code files. It leverages the [ScanCode Toolkit](https://pypi.org/project/scancode-toolkit/) to accurately identify and process licenses and copyrights in various programming languages.

After detecting license and copyright position code has been stored at same column. Now lines which doesn't contain license or copyright copied to same position.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this statement is not very clear:

After detecting license and copyright position code has been stored at same column. Now lines which doesn't contain license or copyright copied to same position.

You mean to say "After locating the position of license or copyright in the input code/sample, this module delete/remove those lines and returns the updated code. There is no change for the input samples which do not contain the license or copyright header.

transforms/code/legal_removal/python/README.md Outdated Show resolved Hide resolved
transforms/code/header_cleanser/README.md Outdated Show resolved Hide resolved
@daw3rd
Copy link
Member

daw3rd commented Jun 28, 2024

Could you please merge dev into this branch given some recent issues with versioning. See #355

@ykalathiya
Copy link
Author

I have added kfp_ray directory and also in README file.

Copy link
Collaborator

@Param-S Param-S left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some more review to do, but a small start.

runtime_actor_options: str = "{'num_cpus': 0.8}",
runtime_pipeline_id: str = "runtime_pipeline_id",
runtime_code_location: str = "{'github': 'github', 'commit_hash': '12345', 'path': 'path'}",
# code quality parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code quality?

runtime_pipeline_id: str,
runtime_job_id: str,
runtime_code_location: str,
) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing specific parameters

"runtime_pipeline_id": runtime_pipeline_id,
"runtime_job_id": runtime_job_id,
"runtime_code_location": runtime_code_location,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing specific parameters

contents_column_name: str = "contents",
license: str = "true",
copyright: str = "true",
# additional parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parameters above should have prefix. Also you can define license and copyright as boolean

DEFAULT_COLUMN = "contents"
DEFAULT_LICENSE = "true"
DEFAULT_COPYRIGHT = "true"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be boolean

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean type gives error while running make-test. That's why I used str format.
error :
ERROR collecting test/test_license_copyright_removal.py ________________________
test_license_copyright_removal.py:70: in get_test_transform_fixtures
self.create_license_copyright_removal_test_fixture(
test_license_copyright_removal.py:52: in create_license_copyright_removal_test_fixture
config = get_transform_config(ftc, cli)
../../../../../data-processing-lib/python/src/data_processing/transform/transform_configuration.py:100: in get_transform_config
args = parser.parse_args(argv)
/usr/lib/python3.11/argparse.py:1886: in parse_args
args, argv = self.parse_known_args(args, namespace)
/usr/lib/python3.11/argparse.py:1919: in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
/usr/lib/python3.11/argparse.py:1962: in _parse_known_args
option_tuple = self._parse_optional(arg_string)
/usr/lib/python3.11/argparse.py:2261: in _parse_optional
if not arg_string[0] in self.prefix_chars:
E TypeError: 'bool' object is not subscriptable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


else :
return [table],{'Removed code count' : remove_code_count}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic above does not seem correct. Can you, please, double check

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works fine.
I have provided value of copyright and license as empty string and this is output of file.
"python header_cleanser_local.py
input table has 10 rows

output table has 10 rows
Output metadata : {'Removed code count': 0}"
You can check it by runnig header_cleanser_local.py.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not. Your initial test:

self.license_remove=="true" and self.copyright_remove=="true":

and your second and third tests are identical. I think the second one should be:

elif self.copyright_remove=="true":

and the third one:

elif self.license_remove=='true':

This module use scancode.api to detect license and
 copyright

Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Build a transform to remove headers from code files
5 participants