Detect text-reuse with Python version of Passim and compare #7

piconti · 2024-05-02T08:49:43Z

Once the full text-reuse clusters have been generated and all works as intended with passim v1, it would be interesting to also perform this detection with the new python version v2, because:

Python version v2 is more recent and currently being maintained, while v1 is old and not maintained anymore
Staying on python instead of having various new dependencies with java, spark, scala etc is simpler in terms of the project's sustainability
The python version does not require the first step of boilerplate detection, which could mean a much faster process.

Hence, based on the results, it might be relevant and useful to switch to the python version in the mid-long-term.

The action points are:

Recompute the text-reuse with the updated version
Compute statistics & visualizations to compare the results with the old version computed on new data
Optionally change the approach for future text-reuse processings and/or look into how to match the scala performance with the python version
Document process when using the python version

piconti mentioned this issue May 2, 2024

Prepare and launch text-reuse detection with Passim #5

Open

9 tasks

e-maud assigned piconti May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect text-reuse with Python version of Passim and compare #7

Detect text-reuse with Python version of Passim and compare #7

piconti commented May 2, 2024 •

edited

Loading

Detect text-reuse with Python version of Passim and compare #7

Detect text-reuse with Python version of Passim and compare #7

Comments

piconti commented May 2, 2024 • edited Loading

piconti commented May 2, 2024 •

edited

Loading