Sunday, July 14, 2019

xml2csv and csv2xml

I was working on a request to convert an XML file to CSV file, I tried to find a tool or a script to do that but I didn't find something that is straightforward.

There might be one, but to be honest I didn't search for a long time and when I found this https://github.com/dfellis/2csv and installed xml2 on ubuntu that is written by Dan Egnor, I decided to ignore finding a ready tool and thought of writing my own.

This is not the first time I write my own script but this is the first time I thought of sharing it and get some feedback and ideas to improve it and I hope this will not be the last (depends on the feedback). So here you go:

xml2csv.sh

#!/bin/bash
###############################################################################
#    Written by Khamis Siksek (Saksoook)
#    khamis dot siksek at gmail dot com
#    14 July 2019
# Description:
#    A bash script that aims to convert XML input to CSV file
# License:
#    This program/script is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see .
###############################################################################

# Passed variables
XML_FILENAME="${1}"; #'Filename.xml';
LINES_TO_SKIP="${2}"; # If there are XML meta tags that do not contain data
CSV_FIELDS_FILENAME="${3}"; #'csv_fields.txt';
INITIAL_ITEM_TAG="${4}"; #'ITEMS';
OUTPUT_DELIMITER="${5:-^}"; # Default value ^

# OUTPUT_FILENAME can be empty and therefore GENERATED_NAME will be used
GENERATED_NAME="$(basename "${XML_FILENAME}")_$$.csv";
OUTPUT_FILENAME="${6:-/tmp/${GENERATED_NAME}}"; # Can be empty

# Validation for mandatory fields (using && operator)
[[ -z "${XML_FILENAME}" ]] && echo 'Error: XML_FILENAME cannot be empty' && exit -1;
[[ -z "${CSV_FIELDS_FILENAME}" ]] && echo 'Error: CSV_FIELDS_FILENAME cannot be empty' && exit -2;
[[ -z "${INITIAL_ITEM_TAG}" ]] && echo 'Error: INITIAL_ITEM_TAG cannot be empty' && exit -3;

# Initiate the file with the header line (extracted from the CSV_FIELDS_FILENAME)
HEADER_LINE=$(echo `cat "${CSV_FIELDS_FILENAME}"` | sed s'/ /'"${OUTPUT_DELIMITER}"'/'g);
echo "${HEADER_LINE}" > "${OUTPUT_FILENAME}";

# +2 for the INITIAL_ITEM_TAG OPENING AND CLOSING and
XML_FIELDS_COUNT="$(($(cat "${CSV_FIELDS_FILENAME}" | wc -l)+2))";

# Counting how many items are in the file
ITEMS_COUNT=$(cat "${XML_FILENAME}" | grep '<'"${INITIAL_ITEM_TAG}"'>' | wc -l);

LINES_OFFSET=0;
declare -a XML_ITEMS;

for ((i=1;i<=${ITEMS_COUNT};i++)); do
{
    # tail +LINES_TO_SKIP skip the first line (Header Line)
    # The sed command is used to add _ to the values so they do not get separated when
    # building the array as the read -a XML_ITEMS uses the SPACE as the separator
    # +${LINES_OFFSET} is needed to skip the read ITEM and go to the next item
    read -a XML_ITEMS <<< $(echo `cat "${XML_FILENAME}" | tail +$((${LINES_TO_SKIP}+${LINES_OFFSET})) | head -"${XML_FIELDS_COUNT}" | sed
 s'/[[:space:]]\+/_/'g`);

    for ((j=1;j<$((${XML_FIELDS_COUNT}-1));j++)); do
    {
        # The first sed command isolates the value from the tags (\2)
        # The second sed command replaces the _ with space (this _ was added in a previous sed command)
        echo -n "${XML_ITEMS[$j]}${OUTPUT_DELIMITER}" | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2/'g | sed s'/_/ /'g >> "${OUTPUT_FILEN
AME}";
    }
    done

    # Adding a new line at the end of each csv record
    echo >> "${OUTPUT_FILENAME}";

    # As stated before this calculates the next starting point for the next item
    LINES_OFFSET=$((LINES_OFFSET+XML_FIELDS_COUNT));
    #echo "${LINES_OFFSET}";
}
done

exit 0;


-------------------------------------------------------------------


csv2xml.sh

#!/bin/bash
###############################################################################
#    Written by Khamis Siksek (Saksoook)
#    khamis dot siksek at gmail dot com
#    14 July 2019
# Description:
#    A bash script that aims to convert XML input to CSV file
# License:
#    This program/script is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see .
###############################################################################

# Passed variables
CSV_FILENAME="${1}"; #'Filename.csv';
INITIAL_ITEM_TAG="${2}"; #'ITEMS';
INPUT_DELIMITER="${3:-^}"; # Default value ^

# OUTPUT_FILENAME can be empty and therefore GENERATED_NAME will be used
GENERATED_NAME="$(basename "${CSV_FILENAME}")_$$.xml";
OUTPUT_FILENAME="${4:-/tmp/${GENERATED_NAME}}";

# Validation for mandatory fields (using && operator)
[[ -z "${CSV_FILENAME}" ]] && echo 'Error: CSV_FILENAME cannot be empty' && exit -1;
[[ -z "${INITIAL_ITEM_TAG}" ]] && echo 'Error: INITIAL_ITEM_TAG cannot be empty' && exit -3;

# Initiate the file with the header line (extracted from the CSV_FIELDS_FILENAME)
HEADER_LINE=$(head -1 "${CSV_FILENAME}");

# Array to contain the XML items tags
declare -a XML_ITEMS;
read -a XML_ITEMS <<< $(echo "${HEADER_LINE//${INPUT_DELIMITER}/ }");

#echo "${#XML_ITEMS[@]}"; # Debugging

> "${OUTPUT_FILENAME}";

# tail +2 to skip the header line
cat "${CSV_FILENAME}" | tail +2 | while read CSV_LINE; do
{
    # Start the item xml tag
    echo '<'"${INITIAL_ITEM_TAG}"'>' >> "${OUTPUT_FILENAME}";

    # For the read command to use the INPUT_DELIMITER as the delimiter
    OIFS="${IFS}";
    IFS="${INPUT_DELIMITER}";
    read -a CSV_LINE_ARR <<< $(echo "${CSV_LINE}");
    IFS="${OIFS}";

    for ((i=0;i<"${#XML_ITEMS[@]}";i++)); do
    {
        # Building the XML tags
        echo -n '    <'"${XML_ITEMS[$i]}"'>' >> "${OUTPUT_FILENAME}";
        echo -n "${CSV_LINE_ARR[$i]}" >> "${OUTPUT_FILENAME}";
        echo -n '</'"${XML_ITEMS[$i]}"'>' >> "${OUTPUT_FILENAME}";
        echo >> "${OUTPUT_FILENAME}";
    }
    done

    # End the item xml tag
    echo '</'"${INITIAL_ITEM_TAG}"'>' >> "${OUTPUT_FILENAME}";
}
done

exit 0;






-------------------------------------------------------------------

To run them (of course they have to be executable):

csv2xml.sh CSV_FILENAME ITEM_START_TAG DELIMITER OUTPUT_FILENAME

xml2csv.sh XML_FILENAME LINES_TO_SKIP CSV_FIELDS_FILE ITEM_START_TAG DELIMITER OUTPUT_FILENAME

Let us assume that you have this example.xml file:

<ITEM>
<ITEM-CODE>1</ITEM-CODE>
<TRANSACTION-DATE>20171207121400</TRANSACTION-DATE>
<ITEM-TYPE>Bus</ITEM-TYPE>
<ORDERED-BY>Customer1</ORDERED-BY>
<ORDER-DATE>20181207111600</ORDER-DATE>
<ORDER-ID>4190126</ORDER-ID>
<CUSTOMER-LOCATION>Egypt</CUSTOMER-LOCATION>
</ITEM>
<ITEM>
<ITEM-CODE>5</ITEM-CODE>
<TRANSACTION-DATE>20161207111400</TRANSACTION-DATE>
<ITEM-TYPE>Car</ITEM-TYPE>
<ORDERED-BY>Customer2</ORDERED-BY>
<ORDER-DATE>20161207111400</ORDER-DATE>
<ORDER-ID>3190996</ORDER-ID>
<CUSTOMER-LOCATION>Jordan</CUSTOMER-LOCATION>
</ITEM>



You will need to generate the example_fields.txt (containing the XML start tags without the <>) to be as the following:

more example_fields.txt

ITEM-CODE
TRANSACTION-DATE
ITEM-TYPE
ORDERED-BY
ORDER-DATE
ORDER-ID
CUSTOMER-LOCATION


If you run xml2csv.sh example.xml 1 example_fields.txt 'ITEM' '^'

You will get

more /tmp/example.xml_14799.csv
 

ITEM-CODE^TRANSACTION-DATE^ITEM-TYPE^ORDERED-BY^ORDER-DATE^ORDER-ID^CUSTOMER-LOCATION
1^20171207121400^Bus^Customer1^20181207111600^4190126^Egypt^
5^20161207111400^Car^Customer2^20161207111400^3190996^Jordan^


when you run csv2xml.sh on the generated file as the following:
bash csv2xml.sh /tmp/example.xml_14799.csv 'ITEM' '^'

You will get

more /tmp/example.xml_14799.csv_14884.xml

<ITEM>
    <ITEM-CODE>1</ITEM-CODE>
    <TRANSACTION-DATE>20171207121400</TRANSACTION-DATE>
    <ITEM-TYPE>Bus</ITEM-TYPE>
    <ORDERED-BY>Customer1</ORDERED-BY>
    <ORDER-DATE>20181207111600</ORDER-DATE>
    <ORDER-ID>4190126</ORDER-ID>
    <CUSTOMER-LOCATION>Egypt</CUSTOMER-LOCATION>
</ITEM>
<ITEM>
    <ITEM-CODE>5</ITEM-CODE>
    <TRANSACTION-DATE>20161207111400</TRANSACTION-DATE>
    <ITEM-TYPE>Car</ITEM-TYPE>
    <ORDERED-BY>Customer2</ORDERED-BY>
    <ORDER-DATE>20161207111400</ORDER-DATE>
    <ORDER-ID>3190996</ORDER-ID>
    <CUSTOMER-LOCATION>Jordan</CUSTOMER-LOCATION>
</ITEM>


7 comments:

David said...

God I love AWK


here is my version of xml2cvs

```
#Copyright
#
#Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
#
#1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
#
#2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
#
#3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
#
#THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

BEGIN {
RS=">"; FS="</"; isHeaderAlPrnt = 0; wth="ITEM"; pa=0;
}

function plist(list , type) {
str = ""
for (x in list) {
k = x;
if (type == 1)
k = list[x];
str = str sprintf("%s%s", k, sp);
}
print substr(str, 0, length(str))
}

isHeaderAlPrnt == 0 {
if ($0 ~ sprintf("<\/%s$", wth)) {
plist(header, 0);
isHeaderAlPrnt = 1;
}
}

{
if ($0 ~ sprintf("<\/%s$", wth)) {
plist(header, 1);
delete header
}
}
{
if ($0 ~ sprintf("<\/%s$", wth)) {
st=0;
} else if ($0 ~ sprintf("<%s$", wth)) {
st=1;
}
}

st == 1 {
if ( NF == 2 ) {
header[$2] = $1;
}
}

```

Usage
```
$ awk -vwth="ITEM" -vsp="," -f xml2cvs.awk CVS-FILE-NAME.cvs
```
where ITEM is the XML element which you want to capture children value.

Same code can be shorter with Perl.


david said...

Compact size perl script
```
use v5.010;

my $match = $ARGV[0];
print $match;
my $st = 0;
my $isHdPrinted = 0;
my %value;
while () {
chomp;
if ($_ =~ /<$match>/ || $_ =~ /<\/$match>/) {
$st = !$st;
my @ksorted = sort(keys(%value));
if (scalar @ksorted != 0) {
say join(',', @ksorted) if !$isHdPrinted;
my @vl;
foreach my $key (@ksorted) {
push @vl, $value{$key};
}
say join(',', @vl);
$isHdPrinted = 1;
}
undef %value;
next;
}
my ($ig, $name, $value) = split(/>|</);
$value{$name} = $value if $st;
}

```
Usage
```
$ cat XML-FILE-NAME | perl -f xml2cvs.pl ITEM
```

saksoook said...

Thanks David for passing by my blog and for providing your way of writing such a tool.

By the way, I love bash I also love awk command though I do not use it frequently.

My problem (and I am jealous) is that the awk script is super fast compared to bash script (less then second compared to around 9 minutes when processing around 13,000 record). I will have to find a way to make the bash script faster :-)

david said...

Thank SakSoook, Glad you like it.

saksoook said...

David i have found a one liner to do the xml2csv conversion please check my new post :-)

david said...

Cool but the one liner solution you provided ignore the following:
1- Items can have a different order inside the XML, current Perl script sort by header name to avoid that.
2- It doesn't generate the headers.
3- It doesn't scope down to specific XML element. Let us say you need to ignore what is inside X Element and Capture what is inside Y element.







Here is a Perl one liner

$ cat sample.xml | perl -F'[>|<]' -E 'say join(",", @list) and @list = () if $F[1] =~ "/ITEM"; push(@list, $F[2]) if length($F[2]) > 1;' > result.csv

real 0m1.534s
user 0m1.512s
sys 0m0.031s

$ time cat sample.xml | sed s'/\(<.\+>\)\(.\+\)\?\(<\/.\+>\)/\2^/'g | tr -d '\n' | sed s'/<\/\?ITEM>/\n/'g | grep -v '^$' > shu.csv

real 0m4.068s
user 0m4.108s
sys 0m0.137s

You can get 2.5 time speed up if you use the Perl one liner. Enjoy!

saksoook said...

I hate you :-) but the best I could do is cheating like this:

head -"$(grep -n -m1 '' sample.xml | cut -d: -f1)" sample.xml | grep -v '<\/\?ITEM>' | cut -d'/' -f2 | tr -s '>\n' '^' && sed s'/.\+>\(.\+\)\?<.\+$/\1^/'g sample.xml | tr -d '\n' | sed s'/<\/\?ITEM>/\n/'g | grep -v '^$'

The previous command gets the header line. And it is faster due to the fact that I simplified the regular expression in the sed command.

The generation of the CSV file assumes that the items are always ordered the same way and to remove an item from the XML it can be done by grep -v or from the CSV file by removing the column.