-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New transformer for license and copyright header removal #332
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a license_copyright_remove/README.md
Can we find a shorter name? Maybe legal_removal, common_removal?
transforms/code/license_copyright_removal/python/requirements.txt
Outdated
Show resolved
Hide resolved
edab4ba
to
ad9b01d
Compare
transforms/code/license_copyright_removal/python/src/license_copyright_removal_transform.py
Outdated
Show resolved
Hide resolved
Excellent! Thanks. |
@ykalathiya Pls suggest a name that calls out the functionality that this module will do. |
legal_clean or code_clean? |
Or this legal_remove is good? |
The functionality is to remove the copyright headers, right or are we missing anything? |
legal_removal is too broad and could be misleading. Same with code_clean. Please suggest something so that one knows what is the functionality of this module. |
Functionality is to remove license and copyright. |
Then we have to use license or copyright in name which makes the bigger name or we can go with only one functionality like license_cleaner. |
ad9b01d
to
111470c
Compare
111470c
to
7e07f63
Compare
Can we name this module as "header_cleanser" as it specifically looks at the copyright info at the beginning of the file. |
|
Also, can you please run |
I am fine with that |
@ykalathiya How many languages will this work for? Have we done any testing? |
I have used this repo https://github.com/arjuncvinod/Hello-World-in-Different-Languages . |
Fantastic! Thank you! |
make conventions run successfully on both python and ray. |
|
||
This module is designed to detect and remove license and copyright information from code files. It leverages the [ScanCode Toolkit](https://pypi.org/project/scancode-toolkit/) to accurately identify and process licenses and copyrights in various programming languages. | ||
|
||
After detecting license and copyright position code has been stored at same column. Now lines which doesn't contain license or copyright copied to same position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this statement is not very clear:
After detecting license and copyright position code has been stored at same column. Now lines which doesn't contain license or copyright copied to same position.
You mean to say "After locating the position of license or copyright in the input code/sample, this module delete/remove those lines and returns the updated code. There is no change for the input samples which do not contain the license or copyright header.
4265b1e
to
a026f2b
Compare
Could you please merge dev into this branch given some recent issues with versioning. See #355 |
a026f2b
to
b50f5ca
Compare
b50f5ca
to
046ee97
Compare
I have added kfp_ray directory and also in README file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some more review to do, but a small start.
transforms/code/header_cleanser/python/src/header_cleanser_transform.py
Outdated
Show resolved
Hide resolved
runtime_actor_options: str = "{'num_cpus': 0.8}", | ||
runtime_pipeline_id: str = "runtime_pipeline_id", | ||
runtime_code_location: str = "{'github': 'github', 'commit_hash': '12345', 'path': 'path'}", | ||
# code quality parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code quality?
runtime_pipeline_id: str, | ||
runtime_job_id: str, | ||
runtime_code_location: str, | ||
) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing specific parameters
"runtime_pipeline_id": runtime_pipeline_id, | ||
"runtime_job_id": runtime_job_id, | ||
"runtime_code_location": runtime_code_location, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing specific parameters
contents_column_name: str = "contents", | ||
license: str = "true", | ||
copyright: str = "true", | ||
# additional parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the parameters above should have prefix. Also you can define license and copyright as boolean
DEFAULT_COLUMN = "contents" | ||
DEFAULT_LICENSE = "true" | ||
DEFAULT_COPYRIGHT = "true" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean type gives error while running make-test. That's why I used str format.
error :
ERROR collecting test/test_license_copyright_removal.py ________________________
test_license_copyright_removal.py:70: in get_test_transform_fixtures
self.create_license_copyright_removal_test_fixture(
test_license_copyright_removal.py:52: in create_license_copyright_removal_test_fixture
config = get_transform_config(ftc, cli)
../../../../../data-processing-lib/python/src/data_processing/transform/transform_configuration.py:100: in get_transform_config
args = parser.parse_args(argv)
/usr/lib/python3.11/argparse.py:1886: in parse_args
args, argv = self.parse_known_args(args, namespace)
/usr/lib/python3.11/argparse.py:1919: in parse_known_args
namespace, args = self._parse_known_args(args, namespace)
/usr/lib/python3.11/argparse.py:1962: in _parse_known_args
option_tuple = self._parse_optional(arg_string)
/usr/lib/python3.11/argparse.py:2261: in _parse_optional
if not arg_string[0] in self.prefix_chars:
E TypeError: 'bool' object is not subscriptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
else : | ||
return [table],{'Removed code count' : remove_code_count} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic above does not seem correct. Can you, please, double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works fine.
I have provided value of copyright and license as empty string and this is output of file.
"python header_cleanser_local.py
input table has 10 rows
output table has 10 rows
Output metadata : {'Removed code count': 0}"
You can check it by runnig header_cleanser_local.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not. Your initial test:
self.license_remove=="true" and self.copyright_remove=="true":
and your second and third tests are identical. I think the second one should be:
elif self.copyright_remove=="true":
and the third one:
elif self.license_remove=='true':
This module use scancode.api to detect license and copyright Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Signed-off-by: Yash Kalathiya <[email protected]>
Why are these changes needed?
It's a new transform module which removes license and copyright header from the input code data. This transforms module depends on (scancode-toolkit)[https://pypi.org/project/scancode-toolkit].
Related issue number (if any).
Closes: #63