Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect text-reuse with Python version of Passim and compare #7

Open
4 tasks
piconti opened this issue May 2, 2024 · 0 comments
Open
4 tasks

Detect text-reuse with Python version of Passim and compare #7

piconti opened this issue May 2, 2024 · 0 comments
Assignees

Comments

@piconti
Copy link
Member

piconti commented May 2, 2024

Once the full text-reuse clusters have been generated and all works as intended with passim v1, it would be interesting to also perform this detection with the new python version v2, because:

  • Python version v2 is more recent and currently being maintained, while v1 is old and not maintained anymore
  • Staying on python instead of having various new dependencies with java, spark, scala etc is simpler in terms of the project's sustainability
  • The python version does not require the first step of boilerplate detection, which could mean a much faster process.

Hence, based on the results, it might be relevant and useful to switch to the python version in the mid-long-term.

The action points are:

  • Recompute the text-reuse with the updated version
  • Compute statistics & visualizations to compare the results with the old version computed on new data
  • Optionally change the approach for future text-reuse processings and/or look into how to match the scala performance with the python version
  • Document process when using the python version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant