Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese and English hybrid log template mining #79

Open
0ptimista opened this issue Apr 26, 2023 · 5 comments
Open

Chinese and English hybrid log template mining #79

0ptimista opened this issue Apr 26, 2023 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@0ptimista
Copy link

有没有办法对中英混合的日志进行日志模板解析?例如日志

[2023-04-21 10:44:52,281][work-request-pool-38][myservice][engine.db.service.OflLTransListService:57][save][INFO ] =>保存:{“Name":"张三","ltermNo":"NH36BILD","lvouchNo":"102755","orderId":"565165056025","oriAmount":50000,"positionInfo":"longitude=101.1.10937&latitude=23426.8491705&address=云南省丽江市&trans_ip=139.14.18.96","printMName":"小型","status":"0","Time":1682045092000},耗时: 300ms

Drain3目前得到的模板可以正确识别 IP地址和大部分变量,但是会将中文识别为模板,例模板中 ”耗时: 300ms“ 是模板的一部分,而不是”耗时:NUMms"

@Superskyyy
Copy link
Collaborator

Superskyyy commented Apr 26, 2023

@0ptimista Hello, this is the problem of the masking (preprocessing) phase. You should consider provide a specific or extended regex (ms)? for this case. The default regex will not mask the 300ms part as it is not a standalone NUM but a NUMms.

你好,这个问题出现在掩码(预处理)阶段。你应该考虑为这种情况提供一个特定或扩展的正则表达式 (ms)?。默认提供的正则表达式不会掩盖 300ms 部分,因为它不是一个独立的数字,而是一个数字加上 ms 的形式。

image

@Superskyyy Superskyyy changed the title 中英混合的日志模板解析 Chinese and English hybrid log template mining Apr 26, 2023
@Superskyyy
Copy link
Collaborator

Superskyyy commented Apr 26, 2023

But indeed in the current algorithm implementation it did not consider a token involving digits to be a variable, as a result, it will lower the similarity of two similar logs with differnt NUMms, but in fact they should not contribute to similarity calculation significantly. It is mentioned in the original paper (DAGDrain) though, so this will be implemented in next releases as a part of the dynamic similarity threshold feature (to replace the default 0.4).

@Superskyyy Superskyyy added the enhancement New feature or request label Apr 26, 2023
@Superskyyy Superskyyy self-assigned this Apr 26, 2023
@0ptimista
Copy link
Author

@0ptimista Hello, this is the problem of the masking (preprocessing) phase. You should consider provide a specific or extended regex (ms)? for this case. The default regex will not mask the 300ms part as it is not a standalone NUM but a NUMms.

你好,这个问题出现在掩码(预处理)阶段。你应该考虑为这种情况提供一个特定或扩展的正则表达式 (ms)?。默认提供的正则表达式不会掩盖 300ms 部分,因为它不是一个独立的数字,而是一个数字加上 ms 的形式。

image

Thanks for replying!

I think I'm not point out my question properly. The solution for 300ms is very straightforward. What I am trying to say is that is there a way to keep some Chinese phrase like '保存' '耗时' as a part of the log template since it's not likely to change in future and some phrase like '张三' in “Name":"张三" is clearly a variable and should be masked.

@Superskyyy
Copy link
Collaborator

Superskyyy commented Apr 26, 2023

@0ptimista If I understand correctly, I beliece it's due to the Chinese word segmentation problem doesn't work like English (which can simply be splited by blank spaces), Chinese characters require an extra layer of processing using things like Jieba https://github.com/fxsjy/jieba to make them into correct sequences of tokens instead of a continous chunk of text.

BUT, there's a but, since using a full segementation solution would potentially decrease performance significantly (This is required if you don't know what would appear in logs), another way (when you do know a dictionary of what words could appear in logs) will be implementing a new set of regex that could replace any known Chinese phrases into masked tokens + surrounded with a blank, then it will be properly clustered.

Would you like to try it and provide an implementation?

@0ptimista
Copy link
Author

Sure. I'm using Drain3 to mine my log, if a solution soloved my original propose, I'll put it here.

Thanks for helping!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants