Skip to content

Commit

Permalink
Updated build scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
ivbeg committed Jun 14, 2024
1 parent ff265b0 commit c1e6634
Show file tree
Hide file tree
Showing 3 changed files with 184 additions and 4 deletions.
180 changes: 180 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
Metacrafter
===========

Python command line tool and python engine to label table fields and
fields in data files. It could help to find meaningful data in your
tables and data files or to find Personal identifable information (PII).

Installation
------------

To install Python library use ``pip install metacrafter`` via pip or
``python setup.py install``

Features
--------

Metacrafter is a rule based tool that helps to label fields of the
tables in databases. It scans table and finds person names, surnames,
midnames, PII data, basic identifiers like UUID/GUID. These rules
written as .yaml files and could be easily extended.

File formats supported: \* CSV \* JSON lines \* JSON (array of records)
\* BSON \* Parquet \* XML

Databases support: \* Any SQL database supported by
`SQLAlchemy <https://www.sqlalchemy.org/>`__ \* NoSQL databases: \*
MongoDB

Metacrafter key features: \* 111 labeling rules \* all labels metadata
collected into `Metacrafter
registry <https://github.com/apicrafter/metacrafter-registry>`__ public
repository \* 312 date detection rules/patterns, date detection using
`qddate <https://github.com/ivbeg/qddate>`__, “quick and dirty” date
detection library \* extendable set of rules using PyParsing, exact text
match and validation functions \* support any database supported by
SQLAlchemy \* advanced context and language management. You could apply
only rules relevant to certain data of choosen language \* built-in API
server \* commercial support and additional rules available

Command line examples
---------------------

File analysis examples
~~~~~~~~~~~~~~~~~~~~~~

::

# Scan CSV file
$ metacrafter scan-file --format short somefile.csv

# Scan CSV file with delimiter ';' and windows-1251 encoding
$ metacrafter scan-file --format short --encoding windows-1251 --delimiter ';' somefile.csv

# Scan JSON lines file, output results as stats table to file file
$ metacrafter scan-file --format stats -o somefile_result.json somefile.jsonl

Result example of ‘full’ type of formatting

::

key ftype tags matches datatype_url
---------------- ------- ------ --------------------------------------------------------------------- ----------------------------------------------------------
Domain str fqdn 99.90 https://registry.apicrafter.io/datatype/fqdn
Primary domain str fqdn 100.00 https://registry.apicrafter.io/datatype/fqdn
Name str name 100.00 https://registry.apicrafter.io/datatype/name
Domain type str dict
Organization str
Status str dict
Region str dict rusregion 22.95 https://registry.apicrafter.io/datatype/rusregion
GovSystem str dict
HTTP Support str dict boolean 100.00 https://registry.apicrafter.io/datatype/boolean
HTTPS Support str dict boolean 100.00 https://registry.apicrafter.io/datatype/boolean
Statuscode str dict
Is archived str empty
Archives str empty
Archive priority str dict
Archive Strategy str dict
ASN str asn 93.77 https://registry.apicrafter.io/datatype/asn
ASN Country code str dict countrycode_alpha2 100.00,countrycode_alpha2 100.00,languagetag 99.56 https://registry.apicrafter.io/datatype/countrycode_alpha2
IPs str ipv4 96.28 https://registry.apicrafter.io/datatype/ipv4
GovType str dict

Database analysis examples
~~~~~~~~~~~~~~~~~~~~~~~~~~

::

# Scan MongoDB database 'fns', save results as result.json and format output as 'stats'
$ metacrafter scan-mongodb --dbname fns -o result.json -f full

# Scan Postgres database 'dbname', with schema 'public'.
$ metacrafter scan-db --schema public --connstr postgresql+psycopg2://username:[email protected]:15432/dbname

Rules
=====

All rules described as YAML files and by default rules loaded from
directory ‘rules’ or from list of directories provided in .metacrafter
file with YAML format

All rules could be applied to **fields** or **data** .

Compare engines defined in **match** parameter in rule description: \*
text - scan text for exact match to one of text values. Text values
delimited by comma (‘,’) \* ppr - scan text for PyParsing. PyParsing
rule defined as Python code with PyParsing objects like Word(nums,
exact=4) \* func - scan text using Python function provided. Function
shoud accept one string parameter and shoud return True or False

How to write rules
------------------

Function (func)
~~~~~~~~~~~~~~~

Example Russian administrative legal act/law matched by custom function

::

runpabyfunc:
key: runpa
name: Russian legal act / law
maxlen: 500
minlen: 3
priority: 1
match: func
type: data
rule: metacrafter.rules.ru.gov.is_ru_law

Exact text match (text)
~~~~~~~~~~~~~~~~~~~~~~~

Example midname matching by exact field name

::

midname:
key: person_midname
name: Person midname by known
rule: midname,secondname,middlename,mid_name,middle_name
type: field
match: text

PyParsing rule (ppr)
~~~~~~~~~~~~~~~~~~~~

Example Russian cadastral number

::

rukadastr:
key: rukadastr
name: Russian land territory cadastral identifier
rule: Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=6, max=7) + Literal(':').suppress() + Word(nums, min=1, max=6)
maxlen: 20
minlen: 12
priority: 1
match: ppr
type: data

Detailed stats
--------------

Rule types: - field based rules 146 - data based rules 102

Context: - common 47 - companies 15 - crypto 3 - datetime 29 - finances
5 - geo 58 - government 19 - identifiers 3 - industry 2 - internet 18 -
medical 6 - objectids 3 - persons 19 - pii 16 - science 2 - software 1 -
values 1 - vehicles 1

Language: - common 100 - de 4 - en 24 - es 1 - fr 11 - ru 108

Data/time patterns (qddate): 312

Commercial support
------------------

Please write [email protected] or [email protected] to request beta
access to commercial API. Commercial API support 195 fields and data
rules and provided with dedicated support.
2 changes: 1 addition & 1 deletion metacrafter/classify/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__author__ = "Ivan Begtin"
__version__ = "0.0.3"
__version__ = "0.0.4"
__license__ = "Apache"
__doc__ = "Metacrafter metadata classification tool"
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,16 +70,16 @@ def run_tests(self):


def long_description():
with codecs.open('README.rst', encoding='utf8') as f:
with codecs.open('README.md', encoding='utf8') as f:
return f.read()


setup(
name='metacrafter',
version=metacrafter.classify.__version__,
description=metacrafter.classify.__doc__.strip(),
long_description=metacrafter.classify.__doc__.strip(),
# long_description_content_type='text/rst',
long_description=long_description(),
long_description_content_type='text/markdown',
url='https://github.com/apicrafter/metacrafter/',
download_url='https://github.com/apicrafter/metacrafter/',
packages=find_packages(exclude=('tests', 'tests.*')),
Expand Down

0 comments on commit c1e6634

Please sign in to comment.