Thanks to David who irritated me since my old xml2csv.sh file used to take 7-8 minutes to generate around 13k+ lines of CSV so I went back and rewritten the script in a way that now it takes around 2 seconds :-D
The script contains validation and comments but putting that aside the coversion is actually one liner command as the following:
tail +n filename.xml | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2^/'g | tr -d '\n' | sed s'/<\/\?ITEM>/\n/'g | grep -v '^$'
The only things that needs to be changed in the previous command are the items highlighted in blue:
<CAR-LICENSE>83838</CAR-LICENSE>
<CAR-MODEL>Ferrari</CAR-MODEL>
<CAR-YEAR>2015</CAR-YEAR>
</ITEM>
And this is the long script:
#!/bin/bash
###############################################################################
# Written by Khamis Siksek (Saksoook)
# khamis dot siksek at gmail dot com
# 14 July 2019
# Description:
# A bash script that aims to convert XML input to CSV file
# License:
# This program/script is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see.
###############################################################################
# Passed variables
XML_FILENAME="${1}"; #'Filename.xml';
LINES_TO_SKIP="${2}"; # If there are XML meta tags that do not contain data
CSV_FIELDS_FILENAME="${3}"; #'csv_fields.txt';
INITIAL_ITEM_TAG="${4}"; #'ITEMS';
OUTPUT_DELIMITER="${5:-^}"; # Default value ^
# OUTPUT_FILENAME can be empty and therefore GENERATED_NAME will be used
GENERATED_NAME="$(basename "${XML_FILENAME}")_$$.csv";
OUTPUT_FILENAME="${6:-/tmp/${GENERATED_NAME}}"; # Can be empty
# Validation for mandatory fields (using && operator)
[[ -z "${XML_FILENAME}" ]] && echo 'Error: XML_FILENAME cannot be empty' && exit -1;
[[ -z "${CSV_FIELDS_FILENAME}" ]] && echo 'Error: CSV_FIELDS_FILENAME cannot be empty' && exit -2;
[[ -z "${INITIAL_ITEM_TAG}" ]] && echo 'Error: INITIAL_ITEM_TAG cannot be empty' && exit -3;
# Initiate the file with the header line (extracted from the CSV_FIELDS_FILENAME)
HEADER_LINE=$(echo `cat "${CSV_FIELDS_FILENAME}"` | sed s'/ /'"${OUTPUT_DELIMITER}"'/'g);
echo -n "${HEADER_LINE}" > "${OUTPUT_FILENAME}";
#######################################################################################################################
# This is much faster than going through a loop and processing the file line-by-line. I kept the loop version for history
# purposes only. I know that using the text processing commands on a whole file is much more faster than doing it
# line-by-line due to the fact that such commands are optimized to process big and huge files and can process files much
# more faster than read each line from a file and spawning command(s) for each line, but I though the impact will be in
# the fraction of seconds but it appears that the impact is really big (2-3 seconds new way : up to 7-8 minutes old way).
#
#Now to explain the command here it is (arrangement of commands is important):
#
#tail +"${LINES_TO_SKIP}" "${XML_FILENAME}": skip the first few lines if not related (such as xml meta-tags)
#sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g: remove the xml tags and add the delimiter
#tr -d '\n': remove the new lines from the result making it one line
#sed s'/<\/\?'"${INITIAL_ITEM_TAG}"'>/\n/'g: replace the INITIAL_ITEM_TAG with a newline
#grep -v '^$' : because of the previous sed command there will be extra empty lines so this command removes the empty lines
#######################################################################################################################
tail +"${LINES_TO_SKIP}" "${XML_FILENAME}" | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g | tr -d '\n' | sed s'/<\/\?'"$
{INITIAL_ITEM_TAG}"'>/\n/'g | grep -v '^$' >> "${OUTPUT_FILENAME}";
exit 0
The script contains validation and comments but putting that aside the coversion is actually one liner command as the following:
tail +n filename.xml | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2^/'g | tr -d '\n' | sed s'/<\/\?ITEM>/\n/'g | grep -v '^$'
The only things that needs to be changed in the previous command are the items highlighted in blue:
- +n: Lines to skip
- filename.xml: The xml file to convert to csv
- ^: The delimiter to be used in the output file
- ITEM: The XML object item header such as in the following:
<CAR-LICENSE>83838</CAR-LICENSE>
<CAR-MODEL>Ferrari</CAR-MODEL>
<CAR-YEAR>2015</CAR-YEAR>
</ITEM>
And this is the long script:
#!/bin/bash
###############################################################################
# Written by Khamis Siksek (Saksoook)
# khamis dot siksek at gmail dot com
# 14 July 2019
# Description:
# A bash script that aims to convert XML input to CSV file
# License:
# This program/script is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see
###############################################################################
# Passed variables
XML_FILENAME="${1}"; #'Filename.xml';
LINES_TO_SKIP="${2}"; # If there are XML meta tags that do not contain data
CSV_FIELDS_FILENAME="${3}"; #'csv_fields.txt';
INITIAL_ITEM_TAG="${4}"; #'ITEMS';
OUTPUT_DELIMITER="${5:-^}"; # Default value ^
# OUTPUT_FILENAME can be empty and therefore GENERATED_NAME will be used
GENERATED_NAME="$(basename "${XML_FILENAME}")_$$.csv";
OUTPUT_FILENAME="${6:-/tmp/${GENERATED_NAME}}"; # Can be empty
# Validation for mandatory fields (using && operator)
[[ -z "${XML_FILENAME}" ]] && echo 'Error: XML_FILENAME cannot be empty' && exit -1;
[[ -z "${CSV_FIELDS_FILENAME}" ]] && echo 'Error: CSV_FIELDS_FILENAME cannot be empty' && exit -2;
[[ -z "${INITIAL_ITEM_TAG}" ]] && echo 'Error: INITIAL_ITEM_TAG cannot be empty' && exit -3;
# Initiate the file with the header line (extracted from the CSV_FIELDS_FILENAME)
HEADER_LINE=$(echo `cat "${CSV_FIELDS_FILENAME}"` | sed s'/ /'"${OUTPUT_DELIMITER}"'/'g);
echo -n "${HEADER_LINE}" > "${OUTPUT_FILENAME}";
#######################################################################################################################
# This is much faster than going through a loop and processing the file line-by-line. I kept the loop version for history
# purposes only. I know that using the text processing commands on a whole file is much more faster than doing it
# line-by-line due to the fact that such commands are optimized to process big and huge files and can process files much
# more faster than read each line from a file and spawning command(s) for each line, but I though the impact will be in
# the fraction of seconds but it appears that the impact is really big (2-3 seconds new way : up to 7-8 minutes old way).
#
#Now to explain the command here it is (arrangement of commands is important):
#
#tail +"${LINES_TO_SKIP}" "${XML_FILENAME}": skip the first few lines if not related (such as xml meta-tags)
#sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g: remove the xml tags and add the delimiter
#tr -d '\n': remove the new lines from the result making it one line
#sed s'/<\/\?'"${INITIAL_ITEM_TAG}"'>/\n/'g: replace the INITIAL_ITEM_TAG with a newline
#grep -v '^$' : because of the previous sed command there will be extra empty lines so this command removes the empty lines
#######################################################################################################################
tail +"${LINES_TO_SKIP}" "${XML_FILENAME}" | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2'"${OUTPUT_DELIMITER}"'/'g | tr -d '\n' | sed s'/<\/\?'"$
{INITIAL_ITEM_TAG}"'>/\n/'g | grep -v '^$' >> "${OUTPUT_FILENAME}";
exit 0