llmsherpa.readers package¶

Submodules¶

llmsherpa.readers.file_reader module¶

class llmsherpa.readers.file_reader.LayoutPDFReader(parser_api_url)¶

Bases: object

Reads PDF content and understands hierarchical layout of the document sections and structural components such as paragraphs, sentences, tables, lists, sublists

Parameters:: parser_api_url (str) – API url for LLM Sherpa. Use customer url for your private instance here

read_pdf(path_or_url, contents=None)¶

Reads pdf from a url or path

Parameters:

path_or_url (str) – path or url to the pdf file e.g. https://someexapmple.com/myfile.pdf or /home/user/myfile.pdf
contents (bytes) – contents of the pdf file. If contents is given, path_or_url is ignored. This is useful when you already have the pdf file contents in memory such as if you are using streamlit or flask.

llmsherpa.readers.layout_reader module¶

class llmsherpa.readers.layout_reader.Block(block_json=None)¶

Bases: object

A block is a node in the layout tree. It can be a paragraph, a list item, a table, or a section header. This is the base class for all blocks such as Paragraph, ListItem, Table, Section.

tag¶

tag of the block e.g. para, list_item, table, header

Type:: str

level¶

level of the block in the layout tree

Type:: int

page_idx¶

page index of the block in the document. It starts from 0 and is -1 if the page number is not available

Type:: int

block_idx¶

id of the block as returned from the server. It starts from 0 and is -1 if the id is not available

Type:: int

top¶

top position of the block in the page and it is -1 if the position is not available - only available for tables

Type:: float

left¶

left position of the block in the page and it is -1 if the position is not available - only available for tables

Type:: float

bbox¶

bounding box of the block in the page and it is [] if the bounding box is not available

Type:: [float]

sentences¶

list of sentences in the block

Type:: list

children¶

list of immediate child blocks, but not the children of the children

Type:: list

parent¶

parent of the block

Type:: Block

block_json¶

json returned by the parser API for the block

Type:: dict

add_child(node)¶: Adds a child to the block. Sets the parent of the child to self.

chunks()¶: Returns all the chunks in the block. Chunking automatically splits the document into paragraphs, lists, and tables without any prior knowledge of the document structure.

iter_children(node, level, node_visitor)¶: Iterates over all the children of the node and calls the node_visitor function on each child.

paragraphs()¶: Returns all the paragraphs in the block. This is useful for getting all the paragraphs in a section.

parent_chain()¶: Returns the parent chain of the block consisting of all the parents of the block until the root.

parent_text()¶: Returns the text of the parent chain of the block. This is useful for adding section information to the text.

sections()¶: Returns all the sections in the block. This is useful for getting all the sections in a document.

tables()¶: Returns all the tables in the block. This is useful for getting all the tables in a section.

tag: str¶

to_context_text(include_section_info=True)¶: Returns the text of the block with section information. This provides context to the text.

to_html(include_children=False, recurse=False)¶: Converts the block to html. This is a virtual method and should be implemented by the derived classes.

to_text(include_children=False, recurse=False)¶: Converts the block to text. This is a virtual method and should be implemented by the derived classes.

class llmsherpa.readers.layout_reader.Document(blocks_json)¶

Bases: object

A document is a tree of blocks. It is the root node of the layout tree.

chunks()¶: Returns all the chunks in the document. Chunking automatically splits the document into paragraphs, lists, and tables without any prior knowledge of the document structure.

sections()¶: Returns all the sections in the document. This is useful for getting all the sections in a document.

tables()¶: Returns all the tables in the document. This is useful for getting all the tables in a document.

to_html(include_duplicates=False)¶: Returns html for the document by iterating through all the sections :param include_duplicates: bool

If True, then html of all the sections is included. If False, then only the html of the top sections is included.

to_text(include_duplicates=False)¶

Returns text of a document by iterating through all the sections ‘

‘

param include_duplicates:: bool If True, then text of all the sections is included. If False, then only the text of the top sections is included.

class llmsherpa.readers.layout_reader.LayoutReader¶

Bases: object

Reads the layout tree from the json returned by the parser API.

debug(pdf_root)¶

read(blocks_json)¶: Reads the layout tree from the json returned by the parser API. Constructs a tree of Block objects.

class llmsherpa.readers.layout_reader.ListItem(list_json)¶

Bases: Block

A list item is a block of text. It can have child list items. A list item has tag ‘list_item’.

to_html(include_children=False, recurse=False)¶

Converts the list item to html. If include_children is True, then the html of the children is also included. If recurse is True, then the html of the children’s children are also included.

Parameters:

include_children (bool) – If True, then the html of the children are also included
recurse (bool) – If True, then the html of the children’s children are also included

to_text(include_children=False, recurse=False)¶

Converts the list item to text. If include_children is True, then the text of the children is also included. If recurse is True, then the text of the children’s children are also included.

Parameters:

include_children (bool) – If True, then the text of the children are also included
recurse (bool) – If True, then the text of the children’s children are also included

class llmsherpa.readers.layout_reader.Paragraph(para_json)¶

Bases: Block

A paragraph is a block of text. It can have children such as lists. A paragraph has tag ‘para’.

to_html(include_children=False, recurse=False)¶

Converts the paragraph to html. If include_children is True, then the html of the children is also included. If recurse is True, then the html of the children’s children are also included.

Parameters:

include_children (bool) – If True, then the html of the children are also included
recurse (bool) – If True, then the html of the children’s children are also included

to_text(include_children=False, recurse=False)¶

Converts the paragraph to text. If include_children is True, then the text of the children is also included. If recurse is True, then the text of the children’s children are also included.

Parameters:

include_children (bool) – If True, then the text of the children are also included
recurse (bool) – If True, then the text of the children’s children are also included

class llmsherpa.readers.layout_reader.Section(section_json)¶

Bases: Block

A section is a block of text. It can have children such as paragraphs, lists, and tables. A section has tag ‘header’.

title¶

title of the section

Type:: str

to_html(include_children=False, recurse=False)¶

Converts the section to html. If include_children is True, then the html of the children is also included. If recurse is True, then the html of the children’s children are also included.

Parameters:

include_children (bool) – If True, then the html of the children are also included
recurse (bool) – If True, then the html of the children’s children are also included

to_text(include_children=False, recurse=False)¶

Converts the section to text. If include_children is True, then the text of the children is also included. If recurse is True, then the text of the children’s children are also included.

Parameters:

include_children (bool) – If True, then the text of the children are also included
recurse (bool) – If True, then the text of the children’s children are also included

class llmsherpa.readers.layout_reader.Table(table_json, parent)¶

Bases: Block

A table is a block of text. It can have child table rows. A table has tag ‘table’.

to_html(include_children=False, recurse=False)¶: Returns html for a <table> with html from all the rows in the table as <tr>

to_text(include_children=False, recurse=False)¶: Returns text of a table with text from all the rows in the table delimited by ‘

‘

class llmsherpa.readers.layout_reader.TableCell(cell_json)¶

Bases: Block

A table cell is a block of text. It can have child paragraphs. A table cell has tag ‘table_cell’. A table cell is contained within table rows.

to_html()¶: Returns the cell value ashtml. If the cell value is a paragraph node, then the html of the node is returned.

to_text()¶: Returns the cell value of the text. If the cell value is a paragraph node, then the text of the node is returned.

class llmsherpa.readers.layout_reader.TableHeader(row_json)¶

Bases: Block

A table header is a block of text. It can have child table cells.

to_html(include_children=False, recurse=False)¶: Returns html for a <th> with html from all the cells in the row as <td>

to_text(include_children=False, recurse=False)¶: Returns text of a row with text from all the cells in the row delimited by ‘|’ and the header row is delimited by ‘—’ Text is returned in markdown format.

class llmsherpa.readers.layout_reader.TableRow(row_json)¶

Bases: Block

A table row is a block of text. It can have child table cells.

to_html(include_children=False, recurse=False)¶: Returns html for a <tr> with html from all the cells in the row as <td>

to_text(include_children=False, recurse=False)¶: Returns text of a row with text from all the cells in the row delimited by ‘|’

llmsherpa.readers package¶

Submodules¶

llmsherpa.readers.file_reader module¶

llmsherpa.readers.layout_reader module¶

Module contents¶

LLM Sherpa

Navigation

Related Topics