Data extraction, preprocessing and treatment methods commons¶
Python3 script with utility functions for preprocessing Author: Danielly Sorato Author contact: danielly.sorato@gmail.com
-
preprocessing.utils.determine_country(filename)[source]¶ Determines the full name of the country, based on ISO code for country that is embedded in the file name.
Parameters: filename (param1) – input file name. Returns: full name of the country (string).
-
preprocessing.utils.determine_sentence_tokenizer(filename)[source]¶ Provide the sentence splitter suffix to instantiate it in accordance to the target language (information emebedded on filename).
Parameters: filename (param1) – input file name. Returns: a sentence splitter suffix (string) according to the target language.
-
preprocessing.utils.get_sentence_splitter(filename)[source]¶ Decide what Instantiate Punkt Sentence Tokenizer from NLTK should be instantiated, according to the information embedded in the filename.
Parameters: filename (param1) – input file name. Returns: a sentence splitter (NLTK object) instantiated according to the target language.
-
preprocessing.utils.recognize_standard_response_scales(filename, text)[source]¶ Recognizes special answer categories from EVS by testing the answer segment against the language dependent pattern definitions for the special categories.
Parameters: - filename (param1) – input file name.
- text (param2) – answer text segment.
Returns: If a pattern was found, returns a string informing the special category, otherwise returns None.
Main method that calls for EVS/ESS scripts to generate MCSQ spreadsheet inputs Author: Danielly Sorato Author contact: danielly.sorato@gmail.com
-
preprocessing.main_xml_files.main(folder_path)[source]¶ This main file calls the transformation algorithms inside evs_xml_data_extraction, ess_xml_data_extraction and ess_xml_data_extraction scripts.
evs_xml_data_extraction is called for EVS files ess_xml_data_extraction is called for ESS files share_xml_data_extraction is called for SHARE files
The algorithm transforms a XML file to a structured spreadsheet format with valuable metadata.
Call main script using folder_path, for instance: reset && python3 main.py /path/to/your/data
Parameters: folder_path (param1) – the path of the directory containing the files to tranform