Supported Plugins
The idea behing plugins is very simple: it's a piece of code which extracts data from/exports to some specific format and interfaces with the core library functions, which will know how to detect and convert data types, export to other formats etc. If you don't find the plugin for the format you need, feel free to contribute. :-)
Each import_from_X
function receive specific parameters (depending on the
format you're working) but also general parameters such as skip_header
and
fields
(they are passed to the rows.plugins.utils.create_table
function).
Some plugins also provide helper functions to work with the specific format,
which can help a lot extracting non-tabular data (like
rows.plugins.html.extract_links
and rows.plugins.pdf.pdf_to_text
).
This documentation is still in progress - please look into the plugins' source code to see all available parameters. Contributions on the documentation are very welcome. Look into the examples folder to see the plugins in action. :)
Current implemented plugins:
Note:
rows
is still not lazy by default, except for some operations likecsv2sqlite
,sqlite2csv
,pgimport
andpgexport
(so usingrows.import_from_X
will put everything in memory), we're working on this.
CSV
Use rows.import_from_csv
and rows.export_to_csv
(dependencies are installed
by default). The CSV dialect is detected automatically but you can specify
it by passing the dialect
parameter.
Helper functions:
rows.plugins.csv.discover_dialect
: tries to figure out the CSV dialect based on a sample (in bytes).rows.utils.csv2sqlite
: lazily convert a CSV into a SQLite table (the command-line version of this function is pretty useful -- see more by runningrows csv2sqlite --help
). The CSV can be optionally compressed (.csv
,.csv.gz
and.csv.xz
).
Learn by example:
List of dicts
Use rows.import_from_dicts
and rows.export_to_dicts
(no dependencies).
Useful when you have the data in memory and would like to detect/convert data
types and/or export to a supported format.
Learn by example:
HTML
Use rows.import_from_html
and rows.export_to_html
(dependencies must be
installed with pip install rows[html]
). You can specify the table index in
case there's more than one <table>
inside the HTML, decide whether to keep
the HTML code inside the <td>
tags (useful to extract links and "hidden"
data) and other options. Very useful in Web scraping.
Learn by example:
Helper functions:
rows.plugins.html.count_tables
: return the number of tables for a given HTML;rows.plugins.html.tag_to_dict
: extract tag's attributes into adict
;rows.plugins.html.extract_text
: extract the text content from a given HTML;rows.plugins.html.extract_links
: extract thehref
attributes from a given HTML (returns a list of strings).
JSON
Use rows.import_from_json
and rows.export_to_json
(no dependencies). Each
table is converted to an array of objects (where each row is represented by an
object).
ODS
Use rows.import_from_ods
(dependencies must be installed with pip install
rows[ods]
).
Parquet
Use rows.import_from_parquet
passing the filename (dependencies must be
installed with pip install rows[parquet]
and if the data is compressed using
snappy you'll also need to pip install rows[parquet-snappy]
and the
libsnappy-dev
system library) -- read this blog post for
more details and one example.
Use rows.import_from_pdf
(dependencies must be installed with pip install
rows[pdf]
).
PDF Parser Backend
There are two available backends (under-the-hood libraries to parse the PDF),
which you can select by passing the backend
parameter (results may differ
depending on the backend):
'pymupdf'
: use if possible, is much faster than the other option;'pdfminer'
: 100% Python implementation, very slow.
Get this list programatically with rows.plugins.pdf.backends()
. You can also
subclass rows.plugins.pdf.PDFBackend
and implement your own PDF parser, if
needed.
Specify Table Boundaries
You can specify some parameters to delimit where the table is located in the PDF, like:
starts_after
andends_before
: delimits the objects before/after the table. Can be: regular strings (exact match); regular expressions objects; or functions (receives the object and must returnTrue
for the object which define if the table starts/ends there).page_numbers
: sequence with desired page numbers (starts from1
).
Specify Detection Algorithms
There are 3 available algorithms to identify text objects and define where the
table is located inside each page - you can subclass them and overwrite some
methods to have custom behaviour (like the get_lines
, where you can access
objects' positions, for example). The algorithms available are (get the list
programatically with rows.plugins.pdf.algorithms()
):
rows.plugins.pdf.YGroupsAlgorithm
: default, group text objects by y position and identify table lines based on these groups.rows.plugins.pdf.HeaderPositionAlgorithm
: use the table header to identify cell positions and then fill the table with found objects (useful in sparse tables).rows.plugins.pdf.RectsBoundariesAlgorithm
: detect the table boundaries by the rectangles on the page (currently only available using the'pdfminer'
backend, which is very slow).
Helper Functions
rows.plugins.pdf.number_of_pages
: returns an integer representing the number of pages of a specific PDF file/stream;rows.plugins.pdf.pdf_to_text
: generator: each iteration will return the text for a specific page (can specifypage_numbers
to delimit which pages will be returned);rows.plugins.pdf.pdf_table_lines
: almost the same asrows.import_from_pdf
, but returns a list of strings instead of arows.Table
object. Useful if the PDF is not well structured and needs some tweaking before importing as arows.Table
(so you can export to another format).
Examples
balneabilidade-brasil
: downloads thousands of PDFs from Brazilian organizations which monitors water quality, then extract the tables in each PDF and put all rows together in one CSV;examples/cli/extract-pdf.sh
: PDF extraction using the command-line interface (the parameters cannot be customized using this method by now -- more improvements in next versions).
PostgreSQL
Use rows.import_from_postgresql
and rows.export_to_postgresql
(dependencies
must be installed with pip install rows[postgresql]
).
Parameters
On both rows.import_from_postgresql
and rows.export_to_postgresql
you can pass
either a connection string or a psycopg2
connection object.
On rows.import_from_postgresql
you can pass a query
parameter instead of a
table_name
.
Helper Functions
rows.utils.pgimport
: import data from CSV into PostgreSQL using the fastest possible method - requires thepsql
command available on your system (the command-line version of this function is pretty useful -- see more by runningrows pgimport --help
). The CSV can be optionally compressed (.csv
,.csv.gz
and.csv.xz
);rows.utils.pgexport
: export data from PostgreSQL into a CSV file using the fastest possible method - requires thepsql
command available on your system (the command-line version of this function is pretty useful -- see more by runningrows pgexport --help
). The CSV can be optionally compressed (.csv
,.csv.gz
and.csv.xz
).
SQLite
Use rows.import_from_sqlite
and rows.export_to_sqlite
(no dependencies).
Helper functions:
rows.utils.sqlite2csv
: lazily SQLite tables into CSV files (the command-line version of this function is pretty useful -- see more by runningrows sqlite2csv --help
). The CSV can be optionally compressed (.csv
,.csv.gz
and.csv.xz
).
TXT
Use rows.import_from_txt
and rows.export_to_txt
(no dependencies). You can
customize the border style.
XLS
Use rows.import_from_xls
and rows.export_to_xls
(dependencies must be
installed with pip install rows[xls]
). You can customize things like
sheet_name
, sheet_index
, start_row
, end_row
, start_column
and
end_column
(the last 5 options are indexes and starts from 0).
On rows.export_to_xls
you can define the sheet_name
.
XLSX
use rows.import_from_xlsx
and rows.export_to_xlsx
(dependencies must be
installed with pip install rows[xlsx]
). You can customize things like
sheet_name
, sheet_index
, start_row
, end_row
, start_column
and
end_column
(the last 5 options are indexes and starts from 0).
On rows.export_to_xlsx
you can define the sheet_name
.
XPath
Dependencies must be installed with pip install rows[xpath]
). Very useful in
Web scraping. Use rows.import_from_xpath
passing the following arguments:
filename_or_fobj
: source XML/HTML;rows_xpath
: XPath to find the elements which will be transformed into rows;fields_xpath
:collections.OrderedDict
containing XPaths for each of the fields (key: field name, value: XPath string) - you'll probrably want to use./
so it'll search inside the row found byrows_xpath
).
Learn by example: