.. _overview: Data Import Tool Overview ========================= .. image:: ../images/user_guide/canopy-data-import-tool-icon-white-removed-01-250x205.png :align: center The Canopy Data Import Tool allows you to import and manipulate text data files in an easy and reproducible way. It is built on top of Pandas, providing an exploratory and graphical interface to data manipulation. After you have manipulated the data, you can take control of the underlying DataFrame from the IPython console in the Canopy Editor. You can also export your command history as a Python script so that you or any of your colleagues can perform the same set of manipulations and reproduce your results. Apart from these docs, you can read more about the Canopy Data Import Tool, look up known issues, and provide feedback through the `Enthought Knowledge Base `_. You can also write to us at `canopy.support@enthought.com `_ if you would like to provide feedback or report a bug. .. _benefits: Benefits of the Data Import Tool -------------------------------- 1. Easily import your data from structured text files, URLs containing embedded tables, or from your clipboard 2. View and manipulate data in the Pandas DataFrame while simultaneously capturing the corresponding Python code 3. Create re-usable recipes for common data munging tasks to expedite future data cleanup For a quick demo of the Data Import Tool, please see the video |demo link|. .. |demo link| raw:: html Enthought Canopy Data Import Tool: CSV & More to Python Pandas DataFrames Reduce time spent on data analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Whether you are a data scientist, a quantitative analyst, an engineer, or evaluating consumer purchase behavior, stock portfolios, or design simulation results, your data analysis workflow probably looks a lot like this: **Acquire** -> **Wrangle** -> Analyze and Model -> Share and Refine -> Publish The problem is that often 50 to 80 percent of time is spent wading through the tedium of the first two steps – acquiring and wrangling data – before even getting to the real work of analysis and insight. The Canopy Data Import Tool can significantly reduce the time you spend on data analysis “dirty work,” by helping you: * Load various data file types and URLs containing embedded tables into Pandas DataFrames * Perform data munging tasks that improve raw data * Handle complicated or messy data * Extend the work done with the tool, to other data files For a detailed tutorial on speeding up common data analysis workflows with the Canopy Data Import Tool, please see the Webinar |webinar link|. .. |webinar link| raw:: html Fast Forward Through Data Munging with Python Data Import and Manipulation Tool .. _launching: Launching --------- There are multiple ways to launch the Data Import Tool: * From the Canopy Editor, click *File --> Import Data* and select the data source you'd like to use -- a file, a URL, or the clipboard .. figure:: /images/user_guide/import-data-canopy-menu.png :align: center :figwidth: image * From the Canopy Editor Toolbar icon, click and select the data source you'd like to use -- a file, a URL, or the clipboard .. figure:: /images/user_guide/import-data-canopy-toolbar-menu.png :align: center :figwidth: image * Right-click data files in the File Browser and select *Import Data* .. figure:: /images/user_guide/import-data-menu.png :align: center :figwidth: image * From within a Jupyter Notebook, click the Data Import Tool icon in the tool bar. .. figure:: /images/user_guide/jupyter-dit-button.png :align: center :figwidth: image .. note:: The Data Import Tool will warn you if you are trying to open a file larger than 70MB in size. While the Tool can open the file and manipulate it, it can be time consuming. At the moment, we suggest you create a smaller data file with a subset of the original data for a more responsive preview. Quick Start Example ------------------- Now that you know how to launch the tool, continue reading for more details about the interface or jump right into the example use cases with :ref:`MLB batting data` and :ref:`Wind data` to see how various commands can be applied and the results they produce. Interface --------- The interface consists of five main components (see :ref:`main-interface-img`): * The :ref:`main_view` has three tabs. The :ref:`data_frame_view` is used to view and manipulate the data. The :ref:`raw_data_view` is a read-only view used to see the raw data being imported. Finally, the :ref:`python_code_view` is used to view the generated Python code representing manipulations. * The :ref:`command_history_pane` lists the manipulations you've performed. Click on a command to select it. * The :ref:`command_editor_pane` is used to edit the currently selected command. * The :ref:`configuration_pane` can be used to change the name of the resulting DataFrame. This pane is hidden by default but can be toggled on/off via the View menu. * The :ref:`log_view_pane` provides detailed messages regarding actions the tool is taking as you try out various commands, such as alerting you that a column's data type was converted or if the tool detected a header line. This pane is hidden by default but can be toggled on/off via the View menu. For more information about the various components, see :ref:`interface`. .. _main-interface-img: .. figure:: /images/user_guide/interface.png :align: center :figwidth: image Figure: Main Interface Commands -------- A command refers to a single data manipulation task -- from renaming a column to deleting rows based on a condition. Every command is completely reversible, so don't be afraid to try them! As commands are executed they are logged in the command history. This allows you to see the exact steps that have been taken while you transform your data. Commands in the command history can be enabled, disabled, or removed. These operations can be done to a command at any point in the history, and the other commands in the history will be reverted and re- executed accordingly to ensure that all commands are performed correctly. **Type Inference** The tool performs type inference on all the columns in your DataFrame. The current type of each column can be seen in a tooltip box when hovering over the column header. Additionally, columns can be converted to many other types. See :ref:`Convert Column` for more information on how. .. note:: For data files with more than 250 columns, the Tool doesn't perform automatic type inference and conversion. In this case, the user is expected to manually convert the columns to the type they wish by using the :ref:`Convert Column` command on their column of choice. Accessing the DataFrame ----------------------- When you are finished manipulating your data, you can click *Use DataFrame* to inject the DataFrame into your Canopy IPython namespace. The name of the DataFrame is set automatically to the filename, which you can change in the :ref:`configuration_pane`. Once you are back in Canopy's IPython console, you should have access to the ``view`` function that can be called with the DataFrame as the argument. This will launch our viewer for easily seeing the full data set. Clicking on the ``View on Close`` checkbox will also launch our viewer after you load the DataFrame into the IPython console. .. _export_script: Exporting your commands to a script ----------------------------------- At any point in the process of manipulating the DataFrame, the user can save the executed commands to a Python script with *Export Code --> To Pandas Script*. .. note:: After the user clicks on *Use Dataframe*, the Data Import Tool automatically saves the commands to a Python script. By default, the scripts are saved in the `data_import_tool/autosaved_scripts` directory in your home directory. You can change this default location in the Data Import Tool ``Preferences`` pane. The ``Preferences`` pane can be accessed from the Data Import Tool Menu bar. This location will also be visible from the `Canopy File Browser`. Saving your DataFrame --------------------- The Tool allows you to save the DataFrame in multiple formats, accessed through *Save --> Save DataFrame --> CSV/Excel File*. .. _save_df_img: .. figure:: /images/user_guide/save_df.png :align: center :figwidth: image Figure: Saving the DataFrame Saving your Command History --------------------------- After a successful import, the Tool also saves the commands from your :ref:`Command History` to a file, unique to each data source. When you load the data source again, the Tool will automatically detect the saved file containing a set of commands and applies them on top of the original data frame. This way, you can start where you left off. The Tool also loads the saved command history when loading data from similarly named files or URLs. For example, commands performed and saved on a data file ``mlb_batting_2008.csv`` will be automatically applied when the file ``mlb_batting_2009.csv`` is loaded.