.. _wiki-data:

Working with Demography of the United Kingdom data
==================================================

In this use case, we will see how to load a table from a URL, and choose between
multiple tables that might be present at the URL. Once we successfully load the
table into the Data Import Tool, we will proceed to look at how we can change
the index of the generated Data Frame. Finally, we will take a peek at the
Python-Pandas code generated by the Tool, which we can use to replicate our
work.

The Data Set
------------

The data used here comes from the wiki page
`Demography of the United Kingdom <https://en.wikipedia.org/w/index.php?title=Demography_of_the_United_Kingdom&oldid=742088706>`_.
From this URL/dataset, we are interested in the 5th table, on
*Vital Statistics since 1960*. On visual inspection of the table, it is obvious
that the first column contains years. But, the column name field is empty,
which will prompt the Data Import Tool to assign a default column name
`Unnamed: 0`.

Columns from table::

    (Year)
    Average population (in thousands)
    Live births
    Deaths
    Natural change
    Crude birth rate (per 1,000)
    Crude death rate (per 1,000)
    Natural change (per 1,000)
    Fertility rates
    Population density


Step 1 : Opening the Data Set
-----------------------------

See :ref:`Launching the Data Import Tool <launching>` to understand how to pass
a URL to the Data Import Tool, from which we can load a table.
The Data Import Tool, by default, opens the very first table on the wiki page.
Instead, let's say we are interested in the 5th table.

Step 2 : Choosing a table
-------------------------

In order to load the the 5th table on the page, we can change the value of the
*Table index on page* to 4 and click *Refresh Data* (Note that counting starts
from zero.)

We can also see that the first row in the table seems to contain
Column Names. By checking the *First row is column name* checkbox and clicking
on *Refresh Data*, we can reload the dataset with the Column Headers. We
also have the option to use the :ref:`Promote Row to Header <promote_row>`
command to perform the same operation.

Refer to :ref:`read_data_command` for more information on the initial read-in.

.. figure:: /images/use_cases/wiki_data/read_html.png
    :align: center
    :figwidth: image


Step 3 : Renaming a Column Header and Setting the Index Column
--------------------------------------------------------------

Given the form of the data, it would be nice to use the 0th column, containing
years, as the Index of the DataFrame. It is currently named ``Unnamed: 0``
because the Tool autonames columns if nothing is specified. We can now
rename the column header by right-clicking on the column header and selecting
the ``Rename`` option.

We can set any column as the Index of the Data Frame by right clicking on the
column name and clicking the *Set as Index* option. For more, see
:ref:`Setting Index<set_index>`.

.. figure:: /images/use_cases/wiki_data/set_index.png
    :align: center
    :figwidth: image

Step 4 : Exporting generated Python Code
----------------------------------------

The Data Import Tool aims to provide a reproducible way of manipulating data
sets. Towards this goal, the Tool generates Python-Pandas code that we can use
to reproduce all of the operations we performs on a file. See
:ref:`export_script` for more information on exporting the generated Python
script. In this particular case, the generated code is :

.. code:: python

    # -*- coding: utf-8 -*-

    import pandas as pd

    # Pandas version check
    from pkg_resources import parse_version
    if parse_version(pd.__version__) != parse_version(u'0.18.1'):
        raise RuntimeError('Invalid pandas version')

    from catalyst.pandas.headers import get_clean_names
    import re

    # Read HTML from Demography_of_the_United_Kingdom
    url= 'https://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom'
    data_frame = pd.read_html(url, header=0, skiprows=0,
                              match='.+', attrs=re.compile('wikitable'))[4]

    # Ensure stripping and uniqueness of column names
    data_frame.columns = get_clean_names(data_frame.columns)

    # Set Unnamed: 0 as the table index
    data_frame.set_index(u'Unnamed: 0', inplace=True)

As we can see from the generated code, ``pandas.read_html`` generates a list
of tables from the URL, from which we chose the ``[4]`` th table.
The additional parameter ``header=0`` tells pandas to use the ``0`` th row in
the table as column names.

Secondly, the call to ``data_frame.set_index`` sets the column ``Unnamed: 0``
to be the Index of the DataFrame.