Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deanonymize(anonymize(text)) != text #1151

Open
zizhong opened this issue Aug 23, 2023 · 8 comments
Open

deanonymize(anonymize(text)) != text #1151

zizhong opened this issue Aug 23, 2023 · 8 comments

Comments

@zizhong
Copy link

zizhong commented Aug 23, 2023

Describe the bug
deanonymize(anonymize(text)) != text

To Reproduce
Steps to reproduce the behavior:

  1. Use a transformer model obi/deid_roberta_i2b2 as analyzer
  2. the text is a medical license number MED-123456
  3. the anonymize() will return a medical license number <ORGANIZATION><ID><US_DRIVER_LICENSE>. The <item> is the base64 encoded encrypted item.
  4. the deanonymizer will return a medical license number MED-123123456

Expected behavior
deanonymize(anonymize(text)) == text

@omri374
Copy link
Contributor

omri374 commented Aug 23, 2023

Hi @zizhong, thanks for reporting this. Would you mind adding the analyzer and anonymizer full results?

@zizhong
Copy link
Author

zizhong commented Aug 23, 2023

@omri374 My pleasure!

Original text:

May 5, 2023
Name: Carl John Smith
DOB: 04/18/1985
SSN: 999-99-9999
Dear DDS Examiner:
Introduction:
Mr. Carl Smith is a 31-year-old man who has been experiencing homelessness on and off for all
his adult life. Mr. Smith says he is about 5’5" and weighs approximately 129 lbs. He presents as
very thin, typically wearing a clean white undershirt and loose-fitting khaki shorts at interviews.
His brown hair is disheveled and dirty looking, and he constantly fidgets and shakes his hand or
knee during interviews. Despite his best efforts, Carl is a poor historian. In interviews with this
writer, he needed constant redirecting and prompting to provide information about his
personal and psychiatric history. Carl is diagnosed with Major Depressive Disorder; recurrent,
Anxiety Disorder, Attention Deficit Hyperactivity Disorder, Intermittent Explosive Disorder, and
a possible traumatic brain injury. Physically, he has degenerative disc disease, Lumbar
radiculopathy, Allergic Rhinitis, and a history of fainting since childhood. When asked why
working is difficult for him, Carl responded "I have a hard time controlling myself. When I get
stressed out, I immediately shut down."

My name is Gavin and I plan to go to San Francisco later today. While there I want to buy 5 apples for 4 dollars each, and 10 bananas for 3 dollars each. How much will this cost me?

Hi, Gavin,

Zizhong Ye and Gordon Liu are schoolmates at Chadbroune Elementry School.

Here are a few example sentences we currently support:

Hello, my name is David Johnson and I live in Maine.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

On September 18 I visited microsoft.com and sent an email to [email protected],  from the IP 192.168.0.1.

My passport: 191280342 and my phone number: (212) 555-1234.

This is a valid International Bank Account Number: IL150120690000003111111 . Can you please check the status on bank account 954567876544?

Kate's social security number is 078-05-1126.  Her driver license? it is 1234567A.


John Smith called Sarah Jane at 321-456-7098 and told her to meet him at 1112 Market Street

During our recent meeting on February 23, 2023, at 10:30 AM, John Doe provided me with his personal details. His email is [email protected] and his contact number is 650-456-7890. He lives in New York City, USA, and belongs to the American nationality with Christian beliefs and a leaning towards the Democratic party. He mentioned that he recently made a transaction using his credit card 4111 1111 1111 1111 and transferred bitcoins to the wallet address 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa. While discussing his European travels, he noted down his IBAN as GB29 NWBK 6016 1331 9268 19. Additionally, he provided his website as https://johndoeportfolio.com. John also discussed some of his US-specific details. He said his bank account number is 1234567890123456 and his drivers license is Y12345678. His ITIN is 987-65-4321, and he recently renewed his passport, the number for which is 123456789. He emphasized not to share his SSN, which is 669-45-6789. Furthermore, he mentioned that he accesses his work files remotely through the IP 192.168.1.1 and has a medical license number MED-123456.

key: 16charEncryptKey16charEncryptKey

Analysis results:  [type: DATE_TIME, start: 1, end: 7, score: 1.0, type: PERSON, start: 19, end: 28, score: 1.0, type: PERSON, start: 29, end: 34, score: 1.0, type: PERSON, start: 105, end: 109, score: 1.0, type: PERSON, start: 110, end: 115, score: 1.0, type: AGE, start: 121, end: 123, score: 1.0, type: PERSON, start: 215, end: 220, score: 1.0, type: PERSON, start: 539, end: 543, score: 1.0, type: PERSON, start: 709, end: 713, score: 1.0, type: PERSON, start: 1077, end: 1081, score: 1.0, type: LOCATION, start: 1221, end: 1224, score: 1.0, type: LOCATION, start: 1225, end: 1234, score: 1.0, type: PERSON, start: 1371, end: 1377, score: 1.0, type: PERSON, start: 1387, end: 1389, score: 1.0, type: PERSON, start: 1394, end: 1400, score: 1.0, type: PERSON, start: 1401, end: 1404, score: 1.0, type: PERSON, start: 1528, end: 1533, score: 1.0, type: PERSON, start: 1534, end: 1541, score: 1.0, type: LOCATION, start: 1556, end: 1561, score: 1.0, type: CREDIT_CARD, start: 1588, end: 1607, score: 1.0, type: CRYPTO, start: 1635, end: 1669, score: 1.0, type: DATE_TIME, start: 1675, end: 1684, score: 1.0, type: DATE_TIME, start: 1685, end: 1687, score: 1.0, type: EMAIL_ADDRESS, start: 1733, end: 1751, score: 1.0, type: PHONE_NUMBER, start: 1824, end: 1829, score: 1.0, type: PHONE_NUMBER, start: 1830, end: 1838, score: 1.0, type: IBAN_CODE, start: 1892, end: 1915, score: 1.0, type: PERSON, start: 2066, end: 2070, score: 1.0, type: PERSON, start: 2071, end: 2076, score: 1.0, type: PERSON, start: 2084, end: 2089, score: 1.0, type: PERSON, start: 2090, end: 2094, score: 1.0, type: UK_NHS, start: 2098, end: 2110, score: 1.0, type: PHONE_NUMBER, start: 2098, end: 2101, score: 1.0, type: LOCATION, start: 2151, end: 2157, score: 1.0, type: DATE_TIME, start: 2188, end: 2200, score: 1.0, type: PERSON, start: 2220, end: 2224, score: 1.0, type: PERSON, start: 2225, end: 2228, score: 1.0, type: EMAIL_ADDRESS, start: 2281, end: 2300, score: 1.0, type: PHONE_NUMBER, start: 2327, end: 2330, score: 1.0, type: LOCATION, start: 2353, end: 2361, score: 1.0, type: LOCATION, start: 2362, end: 2366, score: 1.0, type: LOCATION, start: 2368, end: 2371, score: 1.0, type: CREDIT_CARD, start: 2551, end: 2570, score: 1.0, type: CRYPTO, start: 2618, end: 2652, score: 1.0, type: IBAN_CODE, start: 2719, end: 2746, score: 1.0, type: PERSON, start: 2819, end: 2823, score: 1.0, type: PHONE_NUMBER, start: 3200, end: 3203, score: 1.0, type: PERSON, start: 1195, end: 1200, score: 0.9900000095367432, type: PHONE_NUMBER, start: 1588, end: 1592, score: 0.9900000095367432, type: PHONE_NUMBER, start: 1766, end: 1769, score: 0.9900000095367432, type: LOCATION, start: 2139, end: 2150, score: 0.9900000095367432, type: DATE_TIME, start: 2201, end: 2205, score: 0.9900000095367432, type: ORGANIZATION, start: 2473, end: 2478, score: 0.9900000095367432, type: LOCATION, start: 2851, end: 2853, score: 0.9900000095367432, type: DATE_TIME, start: 8, end: 12, score: 0.9800000190734863, type: PHONE_NUMBER, start: 1774, end: 1775, score: 0.9800000190734863, type: PHONE_NUMBER, start: 2551, end: 2565, score: 0.9800000190734863, type: DATE_TIME, start: 40, end: 43, score: 0.9700000286102295, type: PHONE_NUMBER, start: 2101, end: 2105, score: 0.9700000286102295, type: ORGANIZATION, start: 1445, end: 1451, score: 0.9599999785423279, type: EMAIL, start: 2281, end: 2282, score: 0.9599999785423279, type: IP_ADDRESS, start: 1766, end: 1777, score: 0.95, type: URL, start: 2789, end: 2817, score: 0.95, type: IP_ADDRESS, start: 3200, end: 3211, score: 0.95, type: PHONE_NUMBER, start: 2974, end: 2977, score: 0.949999988079071, type: PHONE_NUMBER, start: 3108, end: 3116, score: 0.9399999976158142, type: PHONE_NUMBER, start: 2977, end: 2983, score: 0.9300000071525574, type: PHONE_NUMBER, start: 3105, end: 3108, score: 0.9100000262260437, type: ORGANIZATION, start: 2289, end: 2296, score: 0.8999999761581421, type: PHONE_NUMBER, start: 1770, end: 1773, score: 0.8899999856948853, type: PHONE_NUMBER, start: 2330, end: 2339, score: 0.8799999952316284, type: PHONE_NUMBER, start: 1592, end: 1607, score: 0.8600000143051147, type: ORGANIZATION, start: 2462, end: 2472, score: 0.8600000143051147, type: ORGANIZATION, start: 1424, end: 1434, score: 0.8500000238418579, type: US_SSN, start: 2014, end: 2025, score: 0.85, type: US_ITIN, start: 2974, end: 2985, score: 0.85, type: US_SSN, start: 3105, end: 3116, score: 0.85, type: PHONE_NUMBER, start: 2106, end: 2110, score: 0.8199999928474426, type: ORGANIZATION, start: 3245, end: 3248, score: 0.8100000023841858, type: PHONE_NUMBER, start: 2566, end: 2569, score: 0.7900000214576721, type: PHONE_NUMBER, start: 2729, end: 2738, score: 0.7900000214576721, type: PHONE_NUMBER, start: 2015, end: 2025, score: 0.7699999809265137, type: PERSON, start: 1378, end: 1379, score: 0.7599999904632568, type: PERSON, start: 1377, end: 1378, score: 0.75, type: PHONE_NUMBER, start: 1824, end: 1838, score: 1.0, type: PHONE_NUMBER, start: 2014, end: 2025, score: 0.7699999809265137, type: PHONE_NUMBER, start: 2327, end: 2339, score: 1.0, type: PERSON, start: 2284, end: 2288, score: 0.7400000095367432, type: ORGANIZATION, start: 1435, end: 1444, score: 0.7200000286102295, type: LOCATION, start: 2392, end: 2400, score: 0.7200000286102295, type: DATE_TIME, start: 43, end: 50, score: 0.699999988079071, type: PERSON, start: 2282, end: 2284, score: 0.6899999976158142, type: ORGANIZATION, start: 1698, end: 1703, score: 0.6700000166893005, type: ORGANIZATION, start: 2800, end: 2801, score: 0.6700000166893005, type: US_DRIVER_LICENSE, start: 2054, end: 2062, score: 0.6499999999999999, type: US_DRIVER_LICENSE, start: 2951, end: 2960, score: 0.6499999999999999, type: PERSON, start: 1379, end: 1386, score: 0.6499999761581421, type: DATE_TIME, start: 40, end: 50, score: 0.9700000286102295, type: PHONE_NUMBER, start: 2569, end: 2570, score: 0.5799999833106995, type: OTHERPHI, start: 1703, end: 1707, score: 0.5099999904632568, type: PERSON, start: 1981, end: 1985, score: 0.5099999904632568, type: US_ITIN, start: 56, end: 67, score: 0.5, type: URL, start: 1698, end: 1711, score: 0.5, type: URL, start: 1738, end: 1749, score: 0.5, type: URL, start: 2289, end: 2300, score: 0.5, type: ID, start: 2744, end: 2746, score: 0.5, type: PHONE_NUMBER, start: 62, end: 63, score: 0.49000000953674316, type: ID, start: 1793, end: 1799, score: 0.49, type: ID, start: 1892, end: 1909, score: 0.49, type: ID, start: 2907, end: 2920, score: 0.49, type: ID, start: 56, end: 62, score: 0.48, type: ID, start: 2014, end: 2015, score: 0.48, type: ID, start: 1966, end: 1976, score: 0.46, type: ID, start: 3049, end: 3056, score: 0.46, type: ID, start: 2054, end: 2059, score: 0.45, type: ID, start: 1635, end: 1644, score: 0.44, type: ID, start: 2719, end: 2726, score: 0.44, type: ID, start: 2739, end: 2743, score: 0.43, type: US_PASSPORT, start: 1793, end: 1802, score: 0.4, type: US_BANK_NUMBER, start: 1966, end: 1978, score: 0.4, type: PHONE_NUMBER, start: 2098, end: 2110, score: 1.0, type: ID, start: 2727, end: 2728, score: 0.4, type: US_BANK_NUMBER, start: 2907, end: 2923, score: 0.4, type: ID, start: 2951, end: 2955, score: 0.4, type: US_PASSPORT, start: 3049, end: 3058, score: 0.4, type: US_DRIVER_LICENSE, start: 3249, end: 3255, score: 0.4, type: ID, start: 1650, end: 1655, score: 0.39, type: ID, start: 2955, end: 2960, score: 0.39, type: ID, start: 2650, end: 2652, score: 0.38, type: DATE_TIME, start: 2789, end: 2794, score: 0.36000001430511475, type: ID, start: 3056, end: 3058, score: 0.35, type: ID, start: 3248, end: 3252, score: 0.34, type: ID, start: 1913, end: 1915, score: 0.32, type: ID, start: 1976, end: 1978, score: 0.32, type: ID, start: 2626, end: 2628, score: 0.32, type: ID, start: 2983, end: 2985, score: 0.32, type: PERSON, start: 2798, end: 2800, score: 0.3100000023841858, type: ID, start: 2618, end: 2621, score: 0.31, type: ID, start: 2628, end: 2629, score: 0.31, type: ID, start: 2643, end: 2646, score: 0.31, type: PHONE_NUMBER, start: 1776, end: 1777, score: 0.30000001192092896, type: ORGANIZATION, start: 2797, end: 2798, score: 0.30000001192092896, type: ID, start: 2629, end: 2631, score: 0.3, type: ID, start: 2726, end: 2727, score: 0.3, type: ID, start: 2640, end: 2641, score: 0.28, type: ID, start: 2641, end: 2643, score: 0.28, type: ID, start: 1799, end: 1802, score: 0.27, type: ID, start: 2632, end: 2638, score: 0.26, type: OTHERPHI, start: 1708, end: 1711, score: 0.23000000417232513, type: ID, start: 2638, end: 2640, score: 0.21, type: ID, start: 63, end: 67, score: 0.2, type: ID, start: 2621, end: 2625, score: 0.19, type: ID, start: 2625, end: 2626, score: 0.16, type: ID, start: 2920, end: 2923, score: 0.16, type: US_PASSPORT, start: 2951, end: 2960, score: 0.1, type: US_BANK_NUMBER, start: 1793, end: 1802, score: 0.05, type: US_SSN, start: 1793, end: 1802, score: 0.05, type: US_BANK_NUMBER, start: 3049, end: 3058, score: 0.05, type: US_DRIVER_LICENSE, start: 1793, end: 1802, score: 0.01, type: US_DRIVER_LICENSE, start: 1966, end: 1978, score: 0.01, type: US_DRIVER_LICENSE, start: 2907, end: 2923, score: 0.01, type: US_DRIVER_LICENSE, start: 3049, end: 3058, score: 0.01]

sanitized_results:
text:

/oSOg6iCSSvrWeZlXxu68BOeKmiTzcNzQsnJGhBuE14= BXBJ6eCU59a5nGvzGtXkVd5oOjJWZ3606NWi6vUgna8=
Name: N4/k/tVfrGcIMHuiEeB4tzn1OPnvfqItq2GsaYL6DzE= ph8f+GFGdzb0kJ7jtupBDHhQmRah/peKV/UgXXEJxxQ=
DOB: doSS1fEZlEXjD/4dpBBgX9AfDo1MBQ6a9LIQmuBM/Zs=
SSN: 2HhEMucehDL/N9PB25Give8hbskDdkX6PKRVbbmBy3c=
Dear DDS Examiner:
Introduction:
Mr. xqYsVNVNr18ennd01WUFwd7uN6H2VMU4ciOoEG0WctI= /VUZ38hgaW8oIOqXKO/V5rhRJapPgYksLqPWPYfsabI= is a D/VKKs+lEPKpi0u8sM4GrCgFl5iRa8DYA0X6gj2D5WA=-year-old man who has been experiencing homelessness on and off for all
his adult life. Mr. NliXY4ki34IfIbzZtjE3uNftlnT32WVvoyJNayCdekY= says he is about 5’5" and weighs approximately 129 lbs. He presents as
very thin, typically wearing a clean white undershirt and loose-fitting khaki shorts at interviews.
His brown hair is disheveled and dirty looking, and he constantly fidgets and shakes his hand or
knee during interviews. Despite his best efforts, VsABjcnQqmUm/j03n4MKg2DqpFCr4pqITtmMifENZeE= is a poor historian. In interviews with this
writer, he needed constant redirecting and prompting to provide information about his
personal and psychiatric history. lBruvd9kF+rvdor093uxwhDtSKL/UK55A3DI+oSywtE= is diagnosed with Major Depressive Disorder; recurrent,
Anxiety Disorder, Attention Deficit Hyperactivity Disorder, Intermittent Explosive Disorder, and
a possible traumatic brain injury. Physically, he has degenerative disc disease, Lumbar
radiculopathy, Allergic Rhinitis, and a history of fainting since childhood. When asked why
working is difficult for him, 9ck+Cm18StxyLGQyKNvC2jBXJmrMpWU4sB8ZrFU1kAM= responded "I have a hard time controlling myself. When I get
stressed out, I immediately shut down."

My name is BKa65ekjqE4WQItErmVhMA/2OOOJN22KHfjgvsCa8so= and I plan to go to s4EvJlsKlpLYKD0zpGUdfft9ShuIEhrPzDzH7jSYEts= jCm9dzARnqHI0iJKC5OMieNLge4kdoVGm8grvb3YlAI= later today. While there I want to buy 5 apples for 4 dollars each, and 10 bananas for 3 dollars each. How much will this cost me?

Hi, Sc9T66XdTiYZ67ZsDXtIt61RjH3Ix4bmDrQzlzHrMU0=WyuALmGVkddrBmBg1hT/y2A5j9xhPrNZ1Ej9CLwbIhg=vDR2T7oK/yvou0saRKzPv5lYKzglBLfi6X0eIFYBJJo=6O/l3Kvxs0LqR9MacXjsndvYIwJy0amzv0DXByXElw8= pjeruxF2mmDdAV0TBTRzKVln6mAyJmq0G/WmQCXY5X0= and oRIwSVmhzsIRUUQdiBq9EG1nNY3jBVaF/rzNY3CeohM= MkXL66jTGCSYWjiLw+SmxnwXt0KbnQqtPkEtDeHBaZ8= are schoolmates at IeA7j0GO0zQX8wQmJkzW76yvRON8t3RWOZDO9FAigYs= dcr2T5rcxmVFxJJb27qkCbuBOOOPLGj8okogyjFxvCE= S3gKzkLnRIFWv1KvmwjDjAs578Ss/P46Y5QNqLO+mJI=.

Here are a few example sentences we currently support:

Hello, my name is BaOX4t4zf6ifgm2ynleYWx2zqI4rAqZwRfVfd5mymg8= jL/Ow3JsVqej2de+JruDmXcVImHsw2h8KEXyAwntAaw= and I live in h5T1VzxIeESGFf0Vwb3TNh0+FvuQhurUu9OVzmgfp9M=.
My credit card number is NrC5Fm+X1XsvO4ni9B1efz3eGXBUpGIja5qUJs4eJKIzWUgvexzrLDkdn1c2h8Vq and my crypto wallet id is xkDFwVdeQ0TFnoW/5WVyKnLbYTesROeB/XBYRGyOQ4Mjz7l0qgpr3DNdUB3CNGsNnfSWC2AHJjSuGQ0V21X9zA==.

On tH4kBVVvfUtQX3YMoNzYyBtyBJIK9Sg+iyWs9kg5ogw= v/5HfMn1UGlK/AMUsbUJ70kwGKg4CA+WvT8MVX8p3rI= I visited 6jafevKfBhz1CVvui9Wvk8t3BFyF18TUsSlPaEaM/IU= and sent an email to hcVH55QTg18VacjpzPcpZ0aIPONprLSNhaYmeZ2IbEOP9/mg/vPTgt7/z5v821iV,  from the IP NuUW3IpNC1Sg6HeluMAuVGa6u1Dsvfg0BZRUKm/l0l0=.

My passport: 9I/qaxhEhair6rHqgtFxlXMWz928SATTrdfJPr0fsmg= and my phone number: 9XyKeczSYzOLypCejq4vx2wb3Oac94XTodujyIyTM5E=.

This is a valid International Bank Account Number: dChBF5PuA8kcMX+ad/Hb/E57lFjvSUgvt/LwegwJKNtUxShlWKmp7vXMSVD3Ny2N . Can you please check the status on bank account qSApTPPKEfvyf9ttwBHxvR7Cwus/fnxLOY5okVAhSWg=?

jNTpQsZUHYTaFJsk8OsqQtIVhGkyw3f3IxRwgTabyKE='s social security number is v5b0SKTg7lFN2CC3BU+IDlNIQ6OD1RndYHbD4PkdweI=.  Her driver license? it is z2Gd0dTlduKZgusZZrm+E2wCi6XddWWR96QwgJjr6Pc=.


+EYE4N6zxuhpgT9dAdnoOEo1ck6FKX3u0DjH+axfNvs= KheWSsytZm/hc1MLoumGJNBIpykcegMJy1OzRuo8t0g= called 1AY6x7lB+gEkEOhEDO38qlKA0ZBvjcJeDBEFoXbo5MA= zm6PpJ1hQwPXKL5+kJblxJxUsOxnoDvR5c2bhDIag9M= at xwpVKZjnLWV6hktpTRAiyinDAyRvOXfsW1Tg9mvV7HI= and told her to meet him at tGQgAzg7BsNc04azpaVfL6RBbe0mmcSL9/ThFmXXEi8= 1XuhIBu/IO9l08LiItzv+PweW5qQOfvZZO1iIc5EYpU=

During our recent meeting on hbi15cSCVRclpEAaJw3DLcNokTF3ay1VYCu7ybJOVhE= 8HlY+yBPE8vadGocrI38aGuJFw6FoOoj2QmlRi+3DtQ=, at 10:30 AM, 7tFViRtxe5BchoD4nEIVSpYuM5mU0lJQLzW6QXxyCq0= MWrvdw1m3gbR16/rp0JPHduUB5sOpng9uo2/6n1CuCA= provided me with his personal details. His email is oVi4cdXrSs26rjglrmsEOIILOsCYhAyIapd8By4ZLIuVf2BLazMvLNDVmSWfrjUU and his contact number is OIoGnYJJKjxN8RL6DW7vzc6oKn/X9z6c60iFX87uBaY=. He lives in F1ZO9fpP8Zkclxkriwy1+xDSWlrACdAM8SgvvR2lz8o= enmAbh33dzbXygPTLVWTeTTT0tEDZ6WsIhyzanx/iUs=, EiLAs79xWju6oeJoFKLs2eTbcxSzeVl615wK2sAs/nI=, and belongs to the FuXlfgMW3OMA02MUoP3n+kvSoMxRpq+RleU6+4iBNEA= nationality with Christian beliefs and a leaning towards the YOV6XhhFbx3ZqxVl/4vTUYd/tswrOCwmvxU4pdnSJm8= gUSPxSEM70gTyMgO/Y4BuhCVzmXP70wTHVrZohzIuY4=. He mentioned that he recently made a transaction using his credit card PfDnKDp8kXwQZBDHS9O6zh+JFSCT2lqIAK+H7A6m7q4WS3qZUy0ZfoVjUvQFzj6a and transferred bitcoins to the wallet address fvh22oXJ70PkniEc+lamum+NlRFA9N0sjb4+azxrOLRc2H/ZOCiA97/Uaazz4FOgETZNhe1CKpwQWG5QpgdHxA==. While discussing his European travels, he noted down his IBAN as tfMNvU/H0GNfpg+L4maWqNx4cNEBiUTj6OLneia1DFa/jERz88GZ9Fzx8GIGONLx. Additionally, he provided his website as adBZKm3695s64cOXz+ZTdV/idRt/ag9q323/Os9jRBYvMElZt024Ut8nTReueC2L. 35HBMp9dKm0gOCONK7+d88JBKWwTnNVFy+mJ9ImRKl8= also discussed some of his TpRc7LDRHcwMPRRXm2NNNUNUae1RL7p6vxFqQPrE8Ko=-specific details. He said his bank account number is xUlkzkuVhON7kJGLbXViSpoC9phr41g8tm93l/H1jHMfwp87lubfiz5Yzr7sKyLN and his drivers license is la252+9F+ZmzPPeJO8XHT+tQaiBl9ypeb+b7/qiA4fo=. His ITIN is /2uD/S1LvqhXJkjjieKNbIwcyB6gwRvpbNs1leIh6LI=, and he recently renewed his passport, the number for which is 9HdRadE311qIMDzFAE77QNrt1kZnniYb0NYRHpMoseI=. He emphasized not to share his SSN, which is pIsW+gc75pCKMJOrVK5c4+v0MrOfkjGeYeFQXaJlKto=. Furthermore, he mentioned that he accesses his work files remotely through the IP hyR4W4fanUhn3FOZgpWOwHr6EZibPlzU2jeAOesbhyI= and has a medical license number c0Ooguq0cTCKtNJYMe0y5tuU9GW7puSkbxugbu1pvKA=kyLs2EJZA9yV41kqJwUQZj4NcFgXE6SY533sIXlNBJ4=Jc6WwHcM3QUw9ZMPFRv6xae6OvQRoDytLls16zyOvwQ=.

items:

[
    {'start': 6172, 'end': 6216, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'Jc6WwHcM3QUw9ZMPFRv6xae6OvQRoDytLls16zyOvwQ=', 'operator': 'encrypt'},
    {'start': 6128, 'end': 6172, 'entity_type': 'ID', 'text': 'kyLs2EJZA9yV41kqJwUQZj4NcFgXE6SY533sIXlNBJ4=', 'operator': 'encrypt'},
    {'start': 6084, 'end': 6128, 'entity_type': 'ORGANIZATION', 'text': 'c0Ooguq0cTCKtNJYMe0y5tuU9GW7puSkbxugbu1pvKA=', 'operator': 'encrypt'},
    {'start': 6006, 'end': 6050, 'entity_type': 'IP_ADDRESS', 'text': 'hyR4W4fanUhn3FOZgpWOwHr6EZibPlzU2jeAOesbhyI=', 'operator': 'encrypt'},
    {'start': 5878, 'end': 5922, 'entity_type': 'US_SSN', 'text': 'pIsW+gc75pCKMJOrVK5c4+v0MrOfkjGeYeFQXaJlKto=', 'operator': 'encrypt'},
    {'start': 5787, 'end': 5831, 'entity_type': 'US_PASSPORT', 'text': '9HdRadE311qIMDzFAE77QNrt1kZnniYb0NYRHpMoseI=', 'operator': 'encrypt'},
    {'start': 5679, 'end': 5723, 'entity_type': 'US_ITIN', 'text': '/2uD/S1LvqhXJkjjieKNbIwcyB6gwRvpbNs1leIh6LI=', 'operator': 'encrypt'},
    {'start': 5621, 'end': 5665, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'la252+9F+ZmzPPeJO8XHT+tQaiBl9ypeb+b7/qiA4fo=', 'operator': 'encrypt'},
    {'start': 5529, 'end': 5593, 'entity_type': 'US_BANK_NUMBER', 'text': 'xUlkzkuVhON7kJGLbXViSpoC9phr41g8tm93l/H1jHMfwp87lubfiz5Yzr7sKyLN', 'operator': 'encrypt'},
    {'start': 5431, 'end': 5475, 'entity_type': 'LOCATION', 'text': 'TpRc7LDRHcwMPRRXm2NNNUNUae1RL7p6vxFqQPrE8Ko=', 'operator': 'encrypt'},
    {'start': 5359, 'end': 5403, 'entity_type': 'PERSON', 'text': '35HBMp9dKm0gOCONK7+d88JBKWwTnNVFy+mJ9ImRKl8=', 'operator': 'encrypt'},
    {'start': 5293, 'end': 5357, 'entity_type': 'URL', 'text': 'adBZKm3695s64cOXz+ZTdV/idRt/ag9q323/Os9jRBYvMElZt024Ut8nTReueC2L', 'operator': 'encrypt'},
    {'start': 5186, 'end': 5250, 'entity_type': 'IBAN_CODE', 'text': 'tfMNvU/H0GNfpg+L4maWqNx4cNEBiUTj6OLneia1DFa/jERz88GZ9Fzx8GIGONLx', 'operator': 'encrypt'},
    {'start': 5031, 'end': 5119, 'entity_type': 'CRYPTO', 'text': 'fvh22oXJ70PkniEc+lamum+NlRFA9N0sjb4+azxrOLRc2H/ZOCiA97/Uaazz4FOgETZNhe1CKpwQWG5QpgdHxA==', 'operator': 'encrypt'},
    {'start': 4919, 'end': 4983, 'entity_type': 'CREDIT_CARD', 'text': 'PfDnKDp8kXwQZBDHS9O6zh+JFSCT2lqIAK+H7A6m7q4WS3qZUy0ZfoVjUvQFzj6a', 'operator': 'encrypt'},
    {'start': 4802, 'end': 4846, 'entity_type': 'ORGANIZATION', 'text': 'gUSPxSEM70gTyMgO/Y4BuhCVzmXP70wTHVrZohzIuY4=', 'operator': 'encrypt'},
    {'start': 4757, 'end': 4801, 'entity_type': 'ORGANIZATION', 'text': 'YOV6XhhFbx3ZqxVl/4vTUYd/tswrOCwmvxU4pdnSJm8=', 'operator': 'encrypt'},
    {'start': 4651, 'end': 4695, 'entity_type': 'LOCATION', 'text': 'FuXlfgMW3OMA02MUoP3n+kvSoMxRpq+RleU6+4iBNEA=', 'operator': 'encrypt'},
    {'start': 4586, 'end': 4630, 'entity_type': 'LOCATION', 'text': 'EiLAs79xWju6oeJoFKLs2eTbcxSzeVl615wK2sAs/nI=', 'operator': 'encrypt'},
    {'start': 4540, 'end': 4584, 'entity_type': 'LOCATION', 'text': 'enmAbh33dzbXygPTLVWTeTTT0tEDZ6WsIhyzanx/iUs=', 'operator': 'encrypt'},
    {'start': 4495, 'end': 4539, 'entity_type': 'LOCATION', 'text': 'F1ZO9fpP8Zkclxkriwy1+xDSWlrACdAM8SgvvR2lz8o=', 'operator': 'encrypt'},
    {'start': 4437, 'end': 4481, 'entity_type': 'PHONE_NUMBER', 'text': 'OIoGnYJJKjxN8RL6DW7vzc6oKn/X9z6c60iFX87uBaY=', 'operator': 'encrypt'},
    {'start': 4346, 'end': 4410, 'entity_type': 'EMAIL_ADDRESS', 'text': 'oVi4cdXrSs26rjglrmsEOIILOsCYhAyIapd8By4ZLIuVf2BLazMvLNDVmSWfrjUU', 'operator': 'encrypt'},
    {'start': 4249, 'end': 4293, 'entity_type': 'PERSON', 'text': 'MWrvdw1m3gbR16/rp0JPHduUB5sOpng9uo2/6n1CuCA=', 'operator': 'encrypt'},
    {'start': 4204, 'end': 4248, 'entity_type': 'PERSON', 'text': '7tFViRtxe5BchoD4nEIVSpYuM5mU0lJQLzW6QXxyCq0=', 'operator': 'encrypt'},
    {'start': 4145, 'end': 4189, 'entity_type': 'DATE_TIME', 'text': '8HlY+yBPE8vadGocrI38aGuJFw6FoOoj2QmlRi+3DtQ=', 'operator': 'encrypt'},
    {'start': 4100, 'end': 4144, 'entity_type': 'DATE_TIME', 'text': 'hbi15cSCVRclpEAaJw3DLcNokTF3ay1VYCu7ybJOVhE=', 'operator': 'encrypt'},
    {'start': 4025, 'end': 4069, 'entity_type': 'LOCATION', 'text': '1XuhIBu/IO9l08LiItzv+PweW5qQOfvZZO1iIc5EYpU=', 'operator': 'encrypt'},
    {'start': 3980, 'end': 4024, 'entity_type': 'LOCATION', 'text': 'tGQgAzg7BsNc04azpaVfL6RBbe0mmcSL9/ThFmXXEi8=', 'operator': 'encrypt'},
    {'start': 3907, 'end': 3951, 'entity_type': 'PHONE_NUMBER', 'text': 'xwpVKZjnLWV6hktpTRAiyinDAyRvOXfsW1Tg9mvV7HI=', 'operator': 'encrypt'},
    {'start': 3859, 'end': 3903, 'entity_type': 'PERSON', 'text': 'zm6PpJ1hQwPXKL5+kJblxJxUsOxnoDvR5c2bhDIag9M=', 'operator': 'encrypt'},
    {'start': 3814, 'end': 3858, 'entity_type': 'PERSON', 'text': '1AY6x7lB+gEkEOhEDO38qlKA0ZBvjcJeDBEFoXbo5MA=', 'operator': 'encrypt'},
    {'start': 3762, 'end': 3806, 'entity_type': 'PERSON', 'text': 'KheWSsytZm/hc1MLoumGJNBIpykcegMJy1OzRuo8t0g=', 'operator': 'encrypt'},
    {'start': 3717, 'end': 3761, 'entity_type': 'PERSON', 'text': '+EYE4N6zxuhpgT9dAdnoOEo1ck6FKX3u0DjH+axfNvs=', 'operator': 'encrypt'},
    {'start': 3669, 'end': 3713, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'z2Gd0dTlduKZgusZZrm+E2wCi6XddWWR96QwgJjr6Pc=', 'operator': 'encrypt'},
    {'start': 3596, 'end': 3640, 'entity_type': 'US_SSN', 'text': 'v5b0SKTg7lFN2CC3BU+IDlNIQ6OD1RndYHbD4PkdweI=', 'operator': 'encrypt'},
    {'start': 3523, 'end': 3567, 'entity_type': 'PERSON', 'text': 'jNTpQsZUHYTaFJsk8OsqQtIVhGkyw3f3IxRwgTabyKE=', 'operator': 'encrypt'},
    {'start': 3476, 'end': 3520, 'entity_type': 'US_BANK_NUMBER', 'text': 'qSApTPPKEfvyf9ttwBHxvR7Cwus/fnxLOY5okVAhSWg=', 'operator': 'encrypt'},
    {'start': 3361, 'end': 3425, 'entity_type': 'IBAN_CODE', 'text': 'dChBF5PuA8kcMX+ad/Hb/E57lFjvSUgvt/LwegwJKNtUxShlWKmp7vXMSVD3Ny2N', 'operator': 'encrypt'},
    {'start': 3263, 'end': 3307, 'entity_type': 'PHONE_NUMBER', 'text': '9XyKeczSYzOLypCejq4vx2wb3Oac94XTodujyIyTM5E=', 'operator': 'encrypt'},
    {'start': 3197, 'end': 3241, 'entity_type': 'US_PASSPORT', 'text': '9I/qaxhEhair6rHqgtFxlXMWz928SATTrdfJPr0fsmg=', 'operator': 'encrypt'},
    {'start': 3137, 'end': 3181, 'entity_type': 'IP_ADDRESS', 'text': 'NuUW3IpNC1Sg6HeluMAuVGa6u1Dsvfg0BZRUKm/l0l0=', 'operator': 'encrypt'},
    {'start': 3058, 'end': 3122, 'entity_type': 'EMAIL_ADDRESS', 'text': 'hcVH55QTg18VacjpzPcpZ0aIPONprLSNhaYmeZ2IbEOP9/mg/vPTgt7/z5v821iV', 'operator': 'encrypt'},
    {'start': 2992, 'end': 3036, 'entity_type': 'URL', 'text': '6jafevKfBhz1CVvui9Wvk8t3BFyF18TUsSlPaEaM/IU=', 'operator': 'encrypt'},
    {'start': 2937, 'end': 2981, 'entity_type': 'DATE_TIME', 'text': 'v/5HfMn1UGlK/AMUsbUJ70kwGKg4CA+WvT8MVX8p3rI=', 'operator': 'encrypt'},
    {'start': 2892, 'end': 2936, 'entity_type': 'DATE_TIME', 'text': 'tH4kBVVvfUtQX3YMoNzYyBtyBJIK9Sg+iyWs9kg5ogw=', 'operator': 'encrypt'},
    {'start': 2798, 'end': 2886, 'entity_type': 'CRYPTO', 'text': 'xkDFwVdeQ0TFnoW/5WVyKnLbYTesROeB/XBYRGyOQ4Mjz7l0qgpr3DNdUB3CNGsNnfSWC2AHJjSuGQ0V21X9zA==', 'operator': 'encrypt'},
    {'start': 2706, 'end': 2770, 'entity_type': 'CREDIT_CARD', 'text': 'NrC5Fm+X1XsvO4ni9B1efz3eGXBUpGIja5qUJs4eJKIzWUgvexzrLDkdn1c2h8Vq', 'operator': 'encrypt'},
    {'start': 2635, 'end': 2679, 'entity_type': 'LOCATION', 'text': 'h5T1VzxIeESGFf0Vwb3TNh0+FvuQhurUu9OVzmgfp9M=', 'operator': 'encrypt'},
    {'start': 2576, 'end': 2620, 'entity_type': 'PERSON', 'text': 'jL/Ow3JsVqej2de+JruDmXcVImHsw2h8KEXyAwntAaw=', 'operator': 'encrypt'},
    {'start': 2531, 'end': 2575, 'entity_type': 'PERSON', 'text': 'BaOX4t4zf6ifgm2ynleYWx2zqI4rAqZwRfVfd5mymg8=', 'operator': 'encrypt'},
    {'start': 2410, 'end': 2454, 'entity_type': 'ORGANIZATION', 'text': 'S3gKzkLnRIFWv1KvmwjDjAs578Ss/P46Y5QNqLO+mJI=', 'operator': 'encrypt'},
    {'start': 2365, 'end': 2409, 'entity_type': 'ORGANIZATION', 'text': 'dcr2T5rcxmVFxJJb27qkCbuBOOOPLGj8okogyjFxvCE=', 'operator': 'encrypt'},
    {'start': 2320, 'end': 2364, 'entity_type': 'ORGANIZATION', 'text': 'IeA7j0GO0zQX8wQmJkzW76yvRON8t3RWOZDO9FAigYs=', 'operator': 'encrypt'},
    {'start': 2256, 'end': 2300, 'entity_type': 'PERSON', 'text': 'MkXL66jTGCSYWjiLw+SmxnwXt0KbnQqtPkEtDeHBaZ8=', 'operator': 'encrypt'},
    {'start': 2211, 'end': 2255, 'entity_type': 'PERSON', 'text': 'oRIwSVmhzsIRUUQdiBq9EG1nNY3jBVaF/rzNY3CeohM=', 'operator': 'encrypt'},
    {'start': 2162, 'end': 2206, 'entity_type': 'PERSON', 'text': 'pjeruxF2mmDdAV0TBTRzKVln6mAyJmq0G/WmQCXY5X0=', 'operator': 'encrypt'},
    {'start': 2117, 'end': 2161, 'entity_type': 'PERSON', 'text': '6O/l3Kvxs0LqR9MacXjsndvYIwJy0amzv0DXByXElw8=', 'operator': 'encrypt'},
    {'start': 2073, 'end': 2117, 'entity_type': 'PERSON', 'text': 'vDR2T7oK/yvou0saRKzPv5lYKzglBLfi6X0eIFYBJJo=', 'operator': 'encrypt'},
    {'start': 2029, 'end': 2073, 'entity_type': 'PERSON', 'text': 'WyuALmGVkddrBmBg1hT/y2A5j9xhPrNZ1Ej9CLwbIhg=', 'operator': 'encrypt'},
    {'start': 1985, 'end': 2029, 'entity_type': 'PERSON', 'text': 'Sc9T66XdTiYZ67ZsDXtIt61RjH3Ix4bmDrQzlzHrMU0=', 'operator': 'encrypt'},
    {'start': 1804, 'end': 1848, 'entity_type': 'LOCATION', 'text': 'jCm9dzARnqHI0iJKC5OMieNLge4kdoVGm8grvb3YlAI=', 'operator': 'encrypt'},
    {'start': 1759, 'end': 1803, 'entity_type': 'LOCATION', 'text': 's4EvJlsKlpLYKD0zpGUdfft9ShuIEhrPzDzH7jSYEts=', 'operator': 'encrypt'},
    {'start': 1694, 'end': 1738, 'entity_type': 'PERSON', 'text': 'BKa65ekjqE4WQItErmVhMA/2OOOJN22KHfjgvsCa8so=', 'operator': 'encrypt'},
    {'start': 1536, 'end': 1580, 'entity_type': 'PERSON', 'text': '9ck+Cm18StxyLGQyKNvC2jBXJmrMpWU4sB8ZrFU1kAM=', 'operator': 'encrypt'},
    {'start': 1128, 'end': 1172, 'entity_type': 'PERSON', 'text': 'lBruvd9kF+rvdor093uxwhDtSKL/UK55A3DI+oSywtE=', 'operator': 'encrypt'},
    {'start': 918, 'end': 962, 'entity_type': 'PERSON', 'text': 'VsABjcnQqmUm/j03n4MKg2DqpFCr4pqITtmMifENZeE=', 'operator': 'encrypt'},
    {'start': 555, 'end': 599, 'entity_type': 'PERSON', 'text': 'NliXY4ki34IfIbzZtjE3uNftlnT32WVvoyJNayCdekY=', 'operator': 'encrypt'},
    {'start': 419, 'end': 463, 'entity_type': 'AGE', 'text': 'D/VKKs+lEPKpi0u8sM4GrCgFl5iRa8DYA0X6gj2D5WA=', 'operator': 'encrypt'},
    {'start': 369, 'end': 413, 'entity_type': 'PERSON', 'text': '/VUZ38hgaW8oIOqXKO/V5rhRJapPgYksLqPWPYfsabI=', 'operator': 'encrypt'},
    {'start': 324, 'end': 368, 'entity_type': 'PERSON', 'text': 'xqYsVNVNr18ennd01WUFwd7uN6H2VMU4ciOoEG0WctI=', 'operator': 'encrypt'},
    {'start': 242, 'end': 286, 'entity_type': 'US_ITIN', 'text': '2HhEMucehDL/N9PB25Give8hbskDdkX6PKRVbbmBy3c=', 'operator': 'encrypt'},
    {'start': 192, 'end': 236, 'entity_type': 'DATE_TIME', 'text': 'doSS1fEZlEXjD/4dpBBgX9AfDo1MBQ6a9LIQmuBM/Zs=', 'operator': 'encrypt'},
    {'start': 142, 'end': 186, 'entity_type': 'PERSON', 'text': 'ph8f+GFGdzb0kJ7jtupBDHhQmRah/peKV/UgXXEJxxQ=', 'operator': 'encrypt'},
    {'start': 97, 'end': 141, 'entity_type': 'PERSON', 'text': 'N4/k/tVfrGcIMHuiEeB4tzn1OPnvfqItq2GsaYL6DzE=', 'operator': 'encrypt'},
    {'start': 46, 'end': 90, 'entity_type': 'DATE_TIME', 'text': 'BXBJ6eCU59a5nGvzGtXkVd5oOjJWZ3606NWi6vUgna8=', 'operator': 'encrypt'},
    {'start': 1, 'end': 45, 'entity_type': 'DATE_TIME', 'text': '/oSOg6iCSSvrWeZlXxu68BOeKmiTzcNzQsnJGhBuE14=', 'operator': 'encrypt'}
]

desanitized_results:
text:

May 5, 2023
Name: Carl John Smith
DOB: 04/18/1985
SSN: 999-99-9999
Dear DDS Examiner:
Introduction:
Mr. Carl Smith is a 31-year-old man who has been experiencing homelessness on and off for all
his adult life. Mr. Smith says he is about 5’5" and weighs approximately 129 lbs. He presents as
very thin, typically wearing a clean white undershirt and loose-fitting khaki shorts at interviews.
His brown hair is disheveled and dirty looking, and he constantly fidgets and shakes his hand or
knee during interviews. Despite his best efforts, Carl is a poor historian. In interviews with this
writer, he needed constant redirecting and prompting to provide information about his
personal and psychiatric history. Carl is diagnosed with Major Depressive Disorder; recurrent,
Anxiety Disorder, Attention Deficit Hyperactivity Disorder, Intermittent Explosive Disorder, and
a possible traumatic brain injury. Physically, he has degenerative disc disease, Lumbar
radiculopathy, Allergic Rhinitis, and a history of fainting since childhood. When asked why
working is difficult for him, Carl responded "I have a hard time controlling myself. When I get
stressed out, I immediately shut down."

My name is Gavin and I plan to go to San Francisco later today. While there I want to buy 5 apples for 4 dollars each, and 10 bananas for 3 dollars each. How much will this cost me?

Hi, Gavin,

Zizhong Ye and Gordon Liu are schoolmates at Chadbroune Elementry School.

Here are a few example sentences we currently support:

Hello, my name is David Johnson and I live in Maine.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

On September 18 I visited microsoft.com and sent an email to [email protected],  from the IP 192.168.0.1.

My passport: 191280342 and my phone number: (212) 555-1234.

This is a valid International Bank Account Number: IL150120690000003111111 . Can you please check the status on bank account 954567876544?

Kate's social security number is 078-05-1126.  Her driver license? it is 1234567A.


John Smith called Sarah Jane at 321-456-7098 and told her to meet him at 1112 Market Street

During our recent meeting on February 23, 2023, at 10:30 AM, John Doe provided me with his personal details. His email is [email protected] and his contact number is 650-456-7890. He lives in New York City, USA, and belongs to the American nationality with Christian beliefs and a leaning towards the Democratic party. He mentioned that he recently made a transaction using his credit card 4111 1111 1111 1111 and transferred bitcoins to the wallet address 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa. While discussing his European travels, he noted down his IBAN as GB29 NWBK 6016 1331 9268 19. Additionally, he provided his website as https://johndoeportfolio.com. John also discussed some of his US-specific details. He said his bank account number is 1234567890123456 and his drivers license is Y12345678. His ITIN is 987-65-4321, and he recently renewed his passport, the number for which is 123456789. He emphasized not to share his SSN, which is 669-45-6789. Furthermore, he mentioned that he accesses his work files remotely through the IP 192.168.1.1 and has a medical license number MED-123123456.

items:

[
    {'start': 3252, 'end': 3258, 'entity_type': 'US_DRIVER_LICENSE', 'text': '123456', 'operator': 'decrypt'},
    {'start': 3248, 'end': 3252, 'entity_type': 'ID', 'text': '-123', 'operator': 'decrypt'},
    {'start': 3245, 'end': 3248, 'entity_type': 'ORGANIZATION', 'text': 'MED', 'operator': 'decrypt'},
    {'start': 3200, 'end': 3211, 'entity_type': 'IP_ADDRESS', 'text': '192.168.1.1', 'operator': 'decrypt'},
    {'start': 3105, 'end': 3116, 'entity_type': 'US_SSN', 'text': '669-45-6789', 'operator': 'decrypt'},
    {'start': 3049, 'end': 3058, 'entity_type': 'US_PASSPORT', 'text': '123456789', 'operator': 'decrypt'},
    {'start': 2974, 'end': 2985, 'entity_type': 'US_ITIN', 'text': '987-65-4321', 'operator': 'decrypt'},
    {'start': 2951, 'end': 2960, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'Y12345678', 'operator': 'decrypt'},
    {'start': 2907, 'end': 2923, 'entity_type': 'US_BANK_NUMBER', 'text': '1234567890123456', 'operator': 'decrypt'},
    {'start': 2851, 'end': 2853, 'entity_type': 'LOCATION', 'text': 'US', 'operator': 'decrypt'},
    {'start': 2819, 'end': 2823, 'entity_type': 'PERSON', 'text': 'John', 'operator': 'decrypt'},
    {'start': 2789, 'end': 2817, 'entity_type': 'URL', 'text': 'https://johndoeportfolio.com', 'operator': 'decrypt'},
    {'start': 2719, 'end': 2746, 'entity_type': 'IBAN_CODE', 'text': 'GB29 NWBK 6016 1331 9268 19', 'operator': 'decrypt'},
    {'start': 2618, 'end': 2652, 'entity_type': 'CRYPTO', 'text': '1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa', 'operator': 'decrypt'},
    {'start': 2551, 'end': 2570, 'entity_type': 'CREDIT_CARD', 'text': '4111 1111 1111 1111', 'operator': 'decrypt'},
    {'start': 2473, 'end': 2478, 'entity_type': 'ORGANIZATION', 'text': 'party', 'operator': 'decrypt'},
    {'start': 2462, 'end': 2472, 'entity_type': 'ORGANIZATION', 'text': 'Democratic', 'operator': 'decrypt'},
    {'start': 2392, 'end': 2400, 'entity_type': 'LOCATION', 'text': 'American', 'operator': 'decrypt'},
    {'start': 2368, 'end': 2371, 'entity_type': 'LOCATION', 'text': 'USA', 'operator': 'decrypt'},
    {'start': 2362, 'end': 2366, 'entity_type': 'LOCATION', 'text': 'City', 'operator': 'decrypt'},
    {'start': 2353, 'end': 2361, 'entity_type': 'LOCATION', 'text': 'New York', 'operator': 'decrypt'},
    {'start': 2327, 'end': 2339, 'entity_type': 'PHONE_NUMBER', 'text': '650-456-7890', 'operator': 'decrypt'},
    {'start': 2281, 'end': 2300, 'entity_type': 'EMAIL_ADDRESS', 'text': '[email protected]', 'operator': 'decrypt'},
    {'start': 2225, 'end': 2228, 'entity_type': 'PERSON', 'text': 'Doe', 'operator': 'decrypt'},
    {'start': 2220, 'end': 2224, 'entity_type': 'PERSON', 'text': 'John', 'operator': 'decrypt'},
    {'start': 2201, 'end': 2205, 'entity_type': 'DATE_TIME', 'text': '2023', 'operator': 'decrypt'},
    {'start': 2188, 'end': 2200, 'entity_type': 'DATE_TIME', 'text': 'February 23,', 'operator': 'decrypt'},
    {'start': 2151, 'end': 2157, 'entity_type': 'LOCATION', 'text': 'Street', 'operator': 'decrypt'},
    {'start': 2139, 'end': 2150, 'entity_type': 'LOCATION', 'text': '1112 Market', 'operator': 'decrypt'},
    {'start': 2098, 'end': 2110, 'entity_type': 'PHONE_NUMBER', 'text': '321-456-7098', 'operator': 'decrypt'},
    {'start': 2090, 'end': 2094, 'entity_type': 'PERSON', 'text': 'Jane', 'operator': 'decrypt'},
    {'start': 2084, 'end': 2089, 'entity_type': 'PERSON', 'text': 'Sarah', 'operator': 'decrypt'},
    {'start': 2071, 'end': 2076, 'entity_type': 'PERSON', 'text': 'Smith', 'operator': 'decrypt'},
    {'start': 2066, 'end': 2070, 'entity_type': 'PERSON', 'text': 'John', 'operator': 'decrypt'},
    {'start': 2054, 'end': 2062, 'entity_type': 'US_DRIVER_LICENSE', 'text': '1234567A', 'operator': 'decrypt'},
    {'start': 2014, 'end': 2025, 'entity_type': 'US_SSN', 'text': '078-05-1126', 'operator': 'decrypt'},
    {'start': 1981, 'end': 1985, 'entity_type': 'PERSON', 'text': 'Kate', 'operator': 'decrypt'},
    {'start': 1966, 'end': 1978, 'entity_type': 'US_BANK_NUMBER', 'text': '954567876544', 'operator': 'decrypt'},
    {'start': 1892, 'end': 1915, 'entity_type': 'IBAN_CODE', 'text': 'IL150120690000003111111', 'operator': 'decrypt'},
    {'start': 1824, 'end': 1838, 'entity_type': 'PHONE_NUMBER', 'text': '(212) 555-1234', 'operator': 'decrypt'},
    {'start': 1793, 'end': 1802, 'entity_type': 'US_PASSPORT', 'text': '191280342', 'operator': 'decrypt'},
    {'start': 1766, 'end': 1777, 'entity_type': 'IP_ADDRESS', 'text': '192.168.0.1', 'operator': 'decrypt'},
    {'start': 1733, 'end': 1751, 'entity_type': 'EMAIL_ADDRESS', 'text': '[email protected]', 'operator': 'decrypt'},
    {'start': 1698, 'end': 1711, 'entity_type': 'URL', 'text': 'microsoft.com', 'operator': 'decrypt'},
    {'start': 1685, 'end': 1687, 'entity_type': 'DATE_TIME', 'text': '18', 'operator': 'decrypt'},
    {'start': 1675, 'end': 1684, 'entity_type': 'DATE_TIME', 'text': 'September', 'operator': 'decrypt'},
    {'start': 1635, 'end': 1669, 'entity_type': 'CRYPTO', 'text': '16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ', 'operator': 'decrypt'},
    {'start': 1588, 'end': 1607, 'entity_type': 'CREDIT_CARD', 'text': '4095-2609-9393-4932', 'operator': 'decrypt'},
    {'start': 1556, 'end': 1561, 'entity_type': 'LOCATION', 'text': 'Maine', 'operator': 'decrypt'},
    {'start': 1534, 'end': 1541, 'entity_type': 'PERSON', 'text': 'Johnson', 'operator': 'decrypt'},
    {'start': 1528, 'end': 1533, 'entity_type': 'PERSON', 'text': 'David', 'operator': 'decrypt'},
    {'start': 1445, 'end': 1451, 'entity_type': 'ORGANIZATION', 'text': 'School', 'operator': 'decrypt'},
    {'start': 1435, 'end': 1444, 'entity_type': 'ORGANIZATION', 'text': 'Elementry', 'operator': 'decrypt'},
    {'start': 1424, 'end': 1434, 'entity_type': 'ORGANIZATION', 'text': 'Chadbroune', 'operator': 'decrypt'},
    {'start': 1401, 'end': 1404, 'entity_type': 'PERSON', 'text': 'Liu', 'operator': 'decrypt'},
    {'start': 1394, 'end': 1400, 'entity_type': 'PERSON', 'text': 'Gordon', 'operator': 'decrypt'},
    {'start': 1387, 'end': 1389, 'entity_type': 'PERSON', 'text': 'Ye', 'operator': 'decrypt'},
    {'start': 1379, 'end': 1386, 'entity_type': 'PERSON', 'text': 'Zizhong', 'operator': 'decrypt'},
    {'start': 1378, 'end': 1379, 'entity_type': 'PERSON', 'text': '\n', 'operator': 'decrypt'},
    {'start': 1377, 'end': 1378, 'entity_type': 'PERSON', 'text': '\n', 'operator': 'decrypt'},
    {'start': 1371, 'end': 1377, 'entity_type': 'PERSON', 'text': 'Gavin,', 'operator': 'decrypt'},
    {'start': 1225, 'end': 1234, 'entity_type': 'LOCATION', 'text': 'Francisco', 'operator': 'decrypt'},
    {'start': 1221, 'end': 1224, 'entity_type': 'LOCATION', 'text': 'San', 'operator': 'decrypt'},
    {'start': 1195, 'end': 1200, 'entity_type': 'PERSON', 'text': 'Gavin', 'operator': 'decrypt'},
    {'start': 1077, 'end': 1081, 'entity_type': 'PERSON', 'text': 'Carl', 'operator': 'decrypt'},
    {'start': 709, 'end': 713, 'entity_type': 'PERSON', 'text': 'Carl', 'operator': 'decrypt'},
    {'start': 539, 'end': 543, 'entity_type': 'PERSON', 'text': 'Carl', 'operator': 'decrypt'},
    {'start': 215, 'end': 220, 'entity_type': 'PERSON', 'text': 'Smith', 'operator': 'decrypt'},
    {'start': 121, 'end': 123, 'entity_type': 'AGE', 'text': '31', 'operator': 'decrypt'},
    {'start': 110, 'end': 115, 'entity_type': 'PERSON', 'text': 'Smith', 'operator': 'decrypt'},
    {'start': 105, 'end': 109, 'entity_type': 'PERSON', 'text': 'Carl', 'operator': 'decrypt'},
    {'start': 56, 'end': 67, 'entity_type': 'US_ITIN', 'text': '999-99-9999', 'operator': 'decrypt'},
    {'start': 40, 'end': 50, 'entity_type': 'DATE_TIME', 'text': '04/18/1985', 'operator': 'decrypt'},
    {'start': 29, 'end': 34, 'entity_type': 'PERSON', 'text': 'Smith', 'operator': 'decrypt'},
    {'start': 19, 'end': 28, 'entity_type': 'PERSON', 'text': 'Carl John', 'operator': 'decrypt'},
    {'start': 8, 'end': 12, 'entity_type': 'DATE_TIME', 'text': '2023', 'operator': 'decrypt'},
    {'start': 1, 'end': 7, 'entity_type': 'DATE_TIME', 'text': 'May 5,', 'operator': 'decrypt'}
]

Result:

Traceback (most recent call last):
  File "/home/zzz/workspace/example/pg.py", line 929, in <module>
    _test()
  File "/home/zzz/workspace/example/pg.py", line 914, in _test
    response_j = sanitize_text(j.encode())
  File "/home/zzz/workspace/example/pg.py", line 597, in sanitize_text
    assert desanitized_results.text == text
AssertionError

@octaviansima
Copy link

To add on to this, I'm running into the following error

  File "/home/os/.venv/lib/python3.10/site-packages/presidio_analyzer/analyzer_engine.py", line 189, in analyze
    nlp_artifacts = self.nlp_engine.process_text(text, language)
  File "/home/os/.venv/lib/python3.10/site-packages/presidio_analyzer/nlp_engine/spacy_nlp_engine.py", line 44, in process_text
    doc = self.nlp[language](text)
  File "/home/os/.venv/lib/python3.10/site-packages/spacy/language.py", line 1047, in __call__
    error_handler(name, proc, [doc], e)
  File "/home/os/.venv/lib/python3.10/site-packages/spacy/util.py", line 1724, in raise_error
    raise e
  File "/home/os/.venv/lib/python3.10/site-packages/spacy/language.py", line 1042, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
  File "/home/os/.venv/lib/python3.10/site-packages/presidio_analyzer/nlp_engine/transformers_nlp_engine.py", line 71, in __call__
    doc.ents = ents
  File "spacy/tokens/doc.pyx", line 796, in spacy.tokens.doc.Doc.ents.__set__
  File "spacy/tokens/doc.pyx", line 833, in spacy.tokens.doc.Doc.set_ents
ValueError: [E1010] Unable to set entity information for token 28 which is included in more than one span in entities, blocked, missing or outside.

With the following code sample

import transformers

from huggingface_hub import snapshot_download

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine

transformers_model = "obi/deid_roberta_i2b2"

snapshot_download(repo_id=transformers_model)

# Instantiate to make sure it's downloaded during installation and not runtime
transformers.AutoTokenizer.from_pretrained(transformers_model)
transformers.AutoModelForTokenClassification.from_pretrained(transformers_model)

# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "transformers",
    "models": [
        {
            "lang_code": "en",
            "model_name": {
                "spacy": "en_core_web_sm",
                "transformers": transformers_model,
            },
        }
    ],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"])

# Initialize the anonymizer and deanonymizer engines
# Possibly put these into a server to avoid reinitialization
anonymizer = AnonymizerEngine()
deanonymizer = DeanonymizeEngine()


text = """
During our recent meeting on February 23, 2023, at 10:30 AM, John Doe provided 
me with his personal details. His email is [email protected] and his contact 
number is 650-456-7890. He lives in New York City, USA, and belongs to the 
American nationality with Christian beliefs and a leaning towards the Democratic party. 
He mentioned that he recently made a transaction using his credit card 4111 1111 1111 1111 
and transferred bitcoins to the wallet address 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa. 
While discussing his European travels, he noted down his IBAN as GB29 NWBK 6016 1331 9268 19. 
Additionally, he provided his website as https://johndoeportfolio.com. John also discussed some 
of his US-specific details. He said his bank account number is 1234567890123456 and his drivers license 
is Y12345678. His ITIN is 987-65-4321, and he recently renewed his passport, the number for 
which is 123456789. He emphasized not to share his SSN, which is 669-45-6789. 
Furthermore, he mentioned that he accesses his work files remotely through the IP 192.168.1.1 
and has a medical license number MED-123456.
"""

analysis_results = analyzer.analyze(text=text, language="en")

I believe this should be related

@omri374
Copy link
Contributor

omri374 commented Aug 24, 2023

Hi @octaviansima, we are aware of this issue. Until we fix it (WIP), it is recommended to use the TransformersRecognizer approach and not the TransformerNlpEngine. This should help with your issue, but please let us know if it doesn't.

@omri374
Copy link
Contributor

omri374 commented Aug 24, 2023

Hi @zizhong, I did an attempt to reproduce this but wasn't able to. Steps I've taken:

  1. Create a TransformersRecognizer and configuration using this sample
  2. Call:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
import spacy

model_path = "obi/deid_roberta_i2b2"
supported_entities = BERT_DEID_CONFIGURATION.get(
    "PRESIDIO_SUPPORTED_ENTITIES")
transformers_recognizer = TransformersRecognizer(model_path=model_path,
                                                 supported_entities=supported_entities)

# This would download a large (~500Mb) model on the first run
transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)

# Add transformers model to the registry
registry = RecognizerRegistry()
registry.add_recognizer(transformers_recognizer)
registry.remove_recognizer("SpacyRecognizer")

# Use small spacy model, for faster inference.
if not spacy.util.is_package("en_core_web_sm"):
    spacy.cli.download("en_core_web_sm")

nlp_configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}

nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()

analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine)
results = analyzer.analyze(text, language="en",
                           return_decision_process=True)

Where text = the text you provided
3. Encrypt:

from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorResult, OperatorConfig
from presidio_anonymizer.operators import Decrypt

key="16charEncryptKey16charEncryptKey"

engine = AnonymizerEngine()

# Invoke the anonymize function with the text,
# analyzer results (potentially coming from presidio-analyzer)
# and an 'encrypt' operator to get an encrypted anonymization output:
anonymize_result = engine.anonymize(
    text=text,
    analyzer_results=results,
    operators={"DEFAULT": OperatorConfig("encrypt", {"key": key})},
)

# Fetch the anonymized text from the result.
anonymized_text = anonymize_result.text

# Fetch the anonynized entities from the result.
anonymized_entities = anonymize_result.items
  1. Decrypt:
# Initialize the engine:
engine = DeanonymizeEngine()

# Invoke the deanonymize function with the text, anonymizer results
# and a 'decrypt' operator to get the original text as output.
deanonymized_result = engine.deanonymize(
    text=anonymized_text,
    entities=anonymized_entities,
    operators={"DEFAULT": OperatorConfig("decrypt", {"key": key})},
)

deanonymized_result.text
  1. Compare: assert text == deanonymized_result.text

We had a few contributions to the presidio-anonymizer package which aren't released to PyPI yet. It could to be that one of them (like #1092 or #1078) is the source of the difference.

@zizhong
Copy link
Author

zizhong commented Aug 24, 2023

Thanks!
The issue is with the code from 🤗 presidio-demo
The issue was caused by chunking overlap. I added some check filtering out the overlaps in predications. Now the issue is resolved.

@omri374
Copy link
Contributor

omri374 commented Aug 24, 2023

Thanks! if you let us know what the issue was, that would be very helpful!

@zizhong
Copy link
Author

zizhong commented Aug 25, 2023

@omri374 sure thing.
https://huggingface.co/spaces/presidio/presidio_demo/blob/main/transformers_rec/transformers_recognizer.py#L267
Here the predications can have overlaps as there is a text_overlap_length for chunking. https://huggingface.co/spaces/presidio/presidio_demo/blob/main/transformers_rec/transformers_recognizer.py#L248

I think that is intended for the use case of only anonymize() used. However it becomes a problem if deanonymize() is applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants