Парсер xml (libxml2)


Содержание

C++ libxml2 parser

Приветствую. Вот немного неработает парсер xml из UTF8 в CP1251, а точнее совсем не работает.
Суть такова: У меня изначально есть код, который считывает данные из xml и их парсит, но не в той кодировке в которой нужно. Собстна вот:

Напиши скрипт который переведёт весь твой 1251-хлам в нормальный юникод.
Если же это невозможно (например г-но продолжает поступать по трубам в кодировке 1251 и источник ты не контролируешь) — напиши потоковый фильтр, который прямо при чтении из файла будет преобразовывать кодировку и передавать в xml-парсер, либо складывать в стринг (string или stringstream)

P.S. Твоя программа надеюсь работает в юникоде (multibyte режим компиляции)? Если нет — очень нехорошо, задумайся прямо сейчас об этом, сделай юникод UTF8 везде.

А он там и так везде UTF-8. Например такой код отображается правильно :

p.s. при переносе из среды разработки на форум или скажем асю, выглядит он так:

У тебя русский текст в исходнике файла, и 100% не юникод кодировка, или форум не юникод, вай вай, дарагой.
Переводи исходник в утф8.
Проверь локаль в панели упр винды. Установи русскую.

При копировании из исходника студия принимает решение, как перекодировать в юникод только что полученный фрагмент текста. Принимается решение видимо, согласно текущей настройке локали (в панели управления опция в языках — язык для программ, которые настолько устарели, что не умеют в юникод) — обычно там ставится русский или украинский (для братьев хохлов), у тебя же видимо другой, и по этим «другим» правилам винда угадывает, что скопированный текст был кракозябрами европейской 1байтовой кодовой таблицы.

Пофикшено. Надо было исправить сам парсер так :

libxml2-2.9.10

Introduction to libxml2

The libxml2 package contains libraries and utilities used for parsing XML files.

This package is known to build and work properly using an LFS-9.0 platform.

Package Information

Download MD5 sum: 10942a1dc23137a8aa07f0639cbfece5

Download size: 5.4 MB

Estimated disk space required: 87 MB (add 15 MB for tests)

Estimated build time: 0.2 SBU (add 0.3 SBU for tests)

Additional Downloads

Optional Testsuite: http://www.w3.org/XML/Test/xmlts20130923.tar.gz — This enables make check to do complete testing.

libxml2 Dependencies

Optional

Some packages which utilize libxml2 (such as GNOME Doc Utils ) need the Python3 module installed to function properly and some packages will not build properly if the Python3 module is not available.

The old Python2 module can be built after libxml2.so has been installed, see libxml2-2.9.10 (for Python2).

Optional

ICU-64.2 and Valgrind-3.15.0 (may be used in the tests)

Installation of libxml2

If you are going to run the tests, disable one test that prevents the tests from completing:

Install libxml2 by running the following commands:

If you downloaded the testsuite, issue the following command:

To test the results, issue: make check > check.log . A summary of the results can be obtained with grep -E ‘^Total|expected’ check.log . If Valgrind-3.15.0 is installed and you want to check memory leaks, replace check with check-valgrind .

The tests use http://localhost/ to test parsing of external entities. If the machine where you run the tests serves as a web site, the tests may hang, depending on the content of the file served. It is therefore recommended to shut down the server during the tests, as the root user:

Now, as the root user:

Command Explanations

—disable-static : This switch prevents installation of static versions of the libraries.

—with-history : This switch enables Readline support when running xmlcatalog or xmllint in shell mode.

—with-python=/usr/bin/python3 : Allows building the libxml2 module with Python3 instead of Python2.

—with-icu : Add this switch if you have built ICU-64.2, for better unicode support.

—with-threads : Add this switch to enable multithread support.

Contents

Short Descriptions

xml2-config

determines the compile and linker flags that should be used to compile and link programs that use libxml2 .

xmlcatalog

is used to monitor and manipulate XML and SGML catalogs.

xmllint

parses XML files and outputs reports (based upon options) to detect errors in XML coding.

provides functions for programs to parse files that use the XML format.

is the interface for Python3 to use libxml2.so .

HTML Parser для C

Я ищу простую в использовании библиотеку парсеров html. В настоящее время я пытаюсь установить libxml2, но я сталкиваюсь с разочаровывающими проблемами. IDE, которую я использую, — это Pelles C, я взял файлы Windows для libxml2 и поместил их в соответствующие папки (заголовки в правильной области заголовка, двоичные файлы в bin, libs в библиотеках и т.д.), Но все же, когда я пытаюсь скомпилировать программу компилятор просто сообщает мне, что каждая функция libxml2, которую я вызываю, не определена. Например:

-subsystem:console -machine:amd64 kernel32.lib advapi32.lib delayimp64.lib Ws2_32.lib libxml2.lib

просто дает мне следующие ошибки при попытке скомпилировать:

вся эта ситуация сводит меня с ума, если кто-нибудь может помочь мне решить эту проблему или, возможно, предложить более простой настройке html-парсера, я бы очень ее оценил.

c xml-parsing libxml2

3 ответа

1 Решение LSerni [2012-08-22 01:51:00]

Эти ошибки связаны с этапом связывания: любая библиотека, которую вы использовали, даст вам те же проблемы.

Если вы не установили неправильный пакет (например, 64-битная библиотека вместо 32 или наоборот).

Для анализа XML, libxml2 — довольно полезный инструмент, он довольно быстрый и довольно мощный. Увидев, как вы уже начали с этого, я попытался бы решить проблемы с компоновщиками.

0 maoyang [2012-08-22 02:33:00]

Я попробовал инструмент под названием html2cxx, который мог разобрать html. Он может анализировать html и css1.0, хотя не обновлялся в течение нескольких лет.

0 Claudix [2012-08-22 01:52:00]

Я когда-то использовал Mini-XML. Он компилируется вместе с компиляторами ANSI C. http://www.minixml.org/

Однако вы должны быть осторожны, потому что разбор HTML не совпадает с разбором XML. Например, в HTML вы можете иметь теги, не закрывая их. Например:

Парсер xml (libxml2)

Interfaces, constants and types related to the XML parser

Table of Contents

Description

Macro: XML_COMPLETE_ATTRS

Bit in the loadsubset context field to tell to do complete the elements attributes lists with the ones defaulted from the DTDs. Use it to initialize xmlLoadExtDtdDefaultValue.


Macro: XML_DEFAULT_VERSION

The default version of XML used: 1.0

Macro: XML_DETECT_IDS

Bit in the loadsubset context field to tell to do ID/REFs lookups. Use it to initialize xmlLoadExtDtdDefaultValue.

Macro: XML_SAX2_MAGIC

Special constant found in SAX2 blocks initialized fields

Macro: XML_SKIP_IDS

Bit in the loadsubset context field to tell to not do ID/REFs registration. Used to initialize xmlLoadExtDtdDefaultValue in some special cases.

Enum xmlParserInputState

Structure xmlParserNodeInfo

Structure xmlParserNodeInfoSeq

Enum xmlParserOption

Structure xmlSAXHandlerV1

Function type: attributeDeclSAXFunc

An attribute definition has been parsed.

ctx: the user data (XML parser context)
elem: the name of the element
fullname: the attribute name
type: the attribute type
def: the type of default value
defaultValue: the attribute default value
tree: the tree of enumerated value set

Function type: attributeSAXFunc

Handle an attribute that has been read by the parser. The default handling is to convert the attribute into an DOM subtree and past it in a new xmlAttr element added to the element.

ctx: the user data (XML parser context)
name: The attribute name, including namespace prefix
value: The attribute value

Function type: cdataBlockSAXFunc

Called when a pcdata block has been parsed.

ctx: the user data (XML parser context)
value: The pcdata content
len: the block length

Function type: charactersSAXFunc

Receiving some chars from the parser.

ctx: the user data (XML parser context)
ch: a xmlChar string
len: the number of xmlChar

Function type: commentSAXFunc

A comment has been parsed.

ctx: the user data (XML parser context)
value: the comment content

Function type: elementDeclSAXFunc

An element definition has been parsed.

ctx: the user data (XML parser context)
name: the element name
type: the element type
content: the element value tree

Function type: endDocumentSAXFunc

Called when the document end has been detected.

ctx: the user data (XML parser context)

Function type: endElementNsSAX2Func

SAX2 callback when an element end has been detected by the parser. It provides the namespace informations for the element.

ctx: the user data (XML parser context)
localname: the local name of the element
prefix: the element namespace prefix if available
URI: the element namespace name if available

Function type: endElementSAXFunc

Called when the end of an element has been detected.

ctx: the user data (XML parser context)
name: The element name

Function type: entityDeclSAXFunc

An entity definition has been parsed.

ctx: the user data (XML parser context)
name: the entity name
type: the entity type
publicId: The public ID of the entity
systemId: The system ID of the entity
content: the entity value (without processing).

Function type: errorSAXFunc

Display and format an error messages, callback.

ctx: an XML parser context
msg: the message to display/transmit
. : extra parameters for the message display

Function type: externalSubsetSAXFunc

Callback on external subset declaration.

ctx: the user data (XML parser context)
name: the root element name
ExternalID: the external ID
SystemID: the SYSTEM ID (e.g. filename or URL)

Function type: fatalErrorSAXFunc

Display and format fatal error messages, callback. Note: so far fatalError() SAX callbacks are not used, error() get all the callbacks for errors.

ctx: an XML parser context
msg: the message to display/transmit
. : extra parameters for the message display

Function type: getEntitySAXFunc

Get an entity by name.

ctx: the user data (XML parser context)
name: The entity name
Returns: the xmlEntityPtr if found.

Function type: getParameterEntitySAXFunc

Get a parameter entity by name.

ctx: the user data (XML parser context)
name: The entity name
Returns: the xmlEntityPtr if found.

Function type: hasExternalSubsetSAXFunc

Does this document has an external subset?

ctx: the user data (XML parser context)
Returns: 1 if true

Function type: hasInternalSubsetSAXFunc

Does this document has an internal subset.

ctx: the user data (XML parser context)
Returns: 1 if true

Function type: ignorableWhitespaceSAXFunc

Receiving some ignorable whitespaces from the parser. UNUSED: by default the DOM building will use characters.

ctx: the user data (XML parser context)
ch: a xmlChar string
len: the number of xmlChar

Function type: internalSubsetSAXFunc

Callback on internal subset declaration.

ctx: the user data (XML parser context)
name: the root element name
ExternalID: the external ID
SystemID: the SYSTEM ID (e.g. filename or URL)


Function type: isStandaloneSAXFunc

Is this document tagged standalone?

ctx: the user data (XML parser context)
Returns: 1 if true

Function type: notationDeclSAXFunc

What to do when a notation declaration has been parsed.

ctx: the user data (XML parser context)
name: The name of the notation
publicId: The public ID of the entity
systemId: The system ID of the entity

Function type: processingInstructionSAXFunc

A processing instruction has been parsed.

ctx: the user data (XML parser context)
target: the target name
data: the PI data’s

Function type: referenceSAXFunc

Called when an entity reference is detected.

ctx: the user data (XML parser context)
name: The entity name

Function type: resolveEntitySAXFunc

Callback: The entity loader, to control the loading of external entities, the application can either: — override this resolveEntity() callback in the SAX block — or better use the xmlSetExternalEntityLoader() function to set up it’s own entity resolution routine

ctx: the user data (XML parser context)
publicId: The public ID of the entity
systemId: The system ID of the entity
Returns: the xmlParserInputPtr if inlined or NULL for DOM behaviour.

Function type: setDocumentLocatorSAXFunc

Receive the document locator at startup, actually xmlDefaultSAXLocator. Everything is available on the context, so this is useless in our case.

Илон Маск рекомендует:  Список полей в конкретной таблице MySQL
ctx: the user data (XML parser context)
loc: A SAX Locator

Function type: startDocumentSAXFunc

Called when the document start being processed.

ctx: the user data (XML parser context)

Function type: startElementNsSAX2Func

SAX2 callback when an element start has been detected by the parser. It provides the namespace informations for the element, as well as the new namespace declarations on the element.

ctx: the user data (XML parser context)
localname: the local name of the element
prefix: the element namespace prefix if available
URI: the element namespace name if available
nb_namespaces: number of namespace definitions on that node
namespaces: pointer to the array of prefix/URI pairs namespace definitions
nb_attributes: the number of attributes on that node
nb_defaulted: the number of defaulted attributes. The defaulted ones are at the end of the array
attributes: pointer to the array of (localname/prefix/URI/value/end) attribute values.

Function type: startElementSAXFunc

Called when an opening tag has been processed.

ctx: the user data (XML parser context)
name: The element name, including namespace prefix
atts: An array of name/value attributes pairs, NULL terminated

Function type: unparsedEntityDeclSAXFunc

What to do when an unparsed entity declaration is parsed.

ctx: the user data (XML parser context)
name: The name of the entity
publicId: The public ID of the entity
systemId: The system ID of the entity
notationName: the name of the notation

Function type: warningSAXFunc

Display and format a warning messages, callback.

ctx: an XML parser context
msg: the message to display/transmit
. : extra parameters for the message display

Function: xmlByteConsumed

This function provides the current index of the parser relative to the start of the current entity. This function is computed in bytes from the beginning starting at zero and finishing at the size in byte of the file if parsing a file. The function is of constant cost if the input is UTF-8 but can be costly if run on non-UTF-8 input.

ctxt: an XML parser context
Returns: the index in bytes from the beginning of the entity or -1 in case the index could not be computed.

Function: xmlCleanupParser

Cleanup function for the XML library. It tries to reclaim all parsing related global memory allocated for the library processing. It doesn’t deallocate any document related memory. Calling this function should not prevent reusing the library but one should call xmlCleanupParser() only when the process has finished using the library or XML document built with it.

Function: xmlClearNodeInfoSeq

— Clear (release memory and reinitialize) node info sequence

seq: a node info sequence pointer

Function: xmlClearParserCtxt

Clear (release owned resources) and reinitialize a parser context

ctxt: an XML parser context

Function: xmlCreateDocParserCtxt

Creates a parser context for an XML in-memory document.

cur: a pointer to an array of xmlChar
Returns: the new parser context or NULL

Function: xmlCreateIOParserCtxt

Create a parser context for using the XML parser with an existing I/O stream

sax: a SAX handler
user_data: The user data returned on SAX callbacks
ioread: an I/O read function
ioclose: an I/O close function
ioctx: an I/O handler
enc: the charset encoding if known
Returns: the new parser context or NULL

Function: xmlCreatePushParserCtxt

Create a parser context for using the XML parser in push mode. If @buffer and @size are non-NULL, the data is used to detect the encoding. The remaining characters will be parsed so they don’t need to be fed in again through xmlParseChunk. To allow content encoding detection, @size should be >= 4 The value of @filename is used for fetching external entities and error/warning reports.

sax: a SAX handler
user_data: The user data returned on SAX callbacks
chunk: a pointer to an array of chars
size: number of chars in the array
filename: an optional file name or URI
Returns: the new parser context or NULL

Function: xmlCtxtReadDoc

parse an XML in-memory document and build a tree. This reuses the existing @ctxt parser context

ctxt: an XML parser context
cur: a pointer to a zero terminated string
URL: the base URL to use for the document
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlCtxtReadFd

parse an XML from a file descriptor and build a tree. This reuses the existing @ctxt parser context NOTE that the file descriptor will not be closed when the reader is closed or reset.

ctxt: an XML parser context
fd: an open file descriptor
URL: the base URL to use for the document
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlCtxtReadFile

parse an XML file from the filesystem or the network. This reuses the existing @ctxt parser context

ctxt: an XML parser context
filename: a file or URL
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlCtxtReadIO

parse an XML document from I/O functions and source and build a tree. This reuses the existing @ctxt parser context

ctxt: an XML parser context
ioread: an I/O read function
ioclose: an I/O close function
ioctx: an I/O handler
URL: the base URL to use for the document
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlCtxtReadMemory

parse an XML in-memory document and build a tree. This reuses the existing @ctxt parser context

ctxt: an XML parser context
buffer: a pointer to a char array
size: the size of the array
URL: the base URL to use for the document
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlCtxtReset

Reset a parser context

ctxt: an XML parser context


Function: xmlCtxtResetPush

Reset a push parser context

ctxt: an XML parser context
chunk: a pointer to an array of chars
size: number of chars in the array
filename: an optional file name or URI
encoding: the document encoding, or NULL
Returns: 0 in case of success and 1 in case of error

Function: xmlCtxtUseOptions

Applies the options to the parser context

ctxt: an XML parser context
options: a combination of xmlParserOption
Returns: 0 in case of success, the set of unknown or unimplemented options in case of error.

Function type: xmlExternalEntityLoader

External entity loaders types.

URL: The System ID of the resource requested
ID: The Public ID of the resource requested
context: the XML parser context
Returns: the entity input parser.

Function: xmlFreeParserCtxt

Free all the memory used by a parser context. However the parsed document in ctxt->myDoc is not freed.

ctxt: an XML parser context

Function: xmlGetExternalEntityLoader

Get the default external entity resolver function for the application

Returns: the xmlExternalEntityLoader function pointer

Function: xmlGetFeature

Read the current value of one feature of this parser instance

ctxt: an XML/HTML parser context
name: the feature name
result: location to store the result
Returns: -1 in case or error, 0 otherwise

Function: xmlGetFeaturesList

Copy at most *@len feature names into the @result array

len: the length of the features name array (input/output)
result: an array of string to be filled with the features name.
Returns: -1 in case or error, or the total number of features, len is updated with the number of strings copied, strings must not be deallocated

Function: xmlIOParseDTD

Load and parse a DTD

sax: the SAX handler block or NULL
input: an Input Buffer
enc: the charset encoding if known
Returns: the resulting xmlDtdPtr or NULL in case of error. @input will be freed at parsing end.

Function: xmlInitNodeInfoSeq

— Initialize (set to initial state) node info sequence

seq: a node info sequence pointer

Function: xmlInitParser

Initialization function for the XML parser. This is not reentrant. Call once before processing in case of use in multithreaded programs.

Function: xmlInitParserCtxt

Initialize a parser context

ctxt: an XML parser context
Returns: 0 in case of success and -1 in case of error

Function: xmlKeepBlanksDefault

Set and return the previous value for default blanks text nodes support. The 1.x version of the parser used an heuristic to try to detect ignorable white spaces. As a result the SAX callback was generating xmlSAX2IgnorableWhitespace() callbacks instead of characters() one, and when using the DOM output text nodes containing those blanks were not generated. The 2.x and later version will switch to the XML standard way and ignorableWhitespace() are only generated when running the parser in validating mode and when the current element doesn’t allow CDATA or mixed content. This function is provided as a way to force the standard behavior on 1.X libs and to switch back to the old mode for compatibility when running 1.X client code on 2.X . Upgrade of 1.X code should be done by using xmlIsBlankNode() commodity function to detect the «empty» nodes generated. This value also affect autogeneration of indentation when saving code if blanks sections are kept, indentation is not generated.

val: int 0 or 1
Returns: the last value for 0 for no substitution, 1 for substitution.

Function: xmlLineNumbersDefault

Set and return the previous value for enabling line numbers in elements contents. This may break on old application and is turned off by default.

val: int 0 or 1
Returns: the last value for 0 for no substitution, 1 for substitution.

Function: xmlLoadExternalEntity

Load an external entity, note that the use of this function for unparsed entities may generate problems TODO: a more generic External entity API must be designed

URL: the URL for the entity to load
ID: the Public ID for the entity to load
ctxt: the context in which the entity is called or NULL
Returns: the xmlParserInputPtr or NULL

Function: xmlNewIOInputStream

Create a new input stream structure encapsulating the @input into a stream suitable for the parser.

ctxt: an XML parser context
input: an I/O Input
enc: the charset encoding if known
Returns: the new input stream or NULL

Function: xmlNewParserCtxt

Allocate and initialize a new parser context.

Returns: the xmlParserCtxtPtr or NULL

Function: xmlParseBalancedChunkMemory

Parse a well-balanced chunk of an XML document called by the parser The allowed sequence for the Well Balanced Chunk is the one defined by the content production in the XML grammar: [43] content ::= (element | CharData | Reference | CDSect | PI | Comment)*

doc: the document the chunk pertains to
sax: the SAX handler bloc (possibly NULL)
user_data: The user data returned on SAX callbacks (possibly NULL)
depth: Used for loop detection, use 0
string: the input string in UTF8 or ISO-Latin (zero terminated)
lst: the return value for the set of parsed nodes
Returns: 0 if the chunk is well balanced, -1 in case of args problem and the parser error code otherwise

Function: xmlParseBalancedChunkMemoryRecover

Parse a well-balanced chunk of an XML document called by the parser The allowed sequence for the Well Balanced Chunk is the one defined by the content production in the XML grammar: [43] content ::= (element | CharData | Reference | CDSect | PI | Comment)*

doc: the document the chunk pertains to
sax: the SAX handler bloc (possibly NULL)
user_data: The user data returned on SAX callbacks (possibly NULL)
depth: Used for loop detection, use 0
string: the input string in UTF8 or ISO-Latin (zero terminated)
lst: the return value for the set of parsed nodes
recover: return nodes even if the data is broken (use 0)
Returns: 0 if the chunk is well balanced, -1 in case of args problem and the parser error code otherwise In case recover is set to 1, the nodelist will not be empty even if the parsed chunk is not well balanced.

Function: xmlParseChunk

Parse a Chunk of memory

ctxt: an XML parser context
chunk: an char array
size: the size in byte of the chunk
terminate: last chunk indicator
Returns: zero if no error, the xmlParserErrors otherwise.

Function: xmlParseCtxtExternalEntity

Parse an external general entity within an existing parsing context An external general parsed entity is well-formed if it matches the production labeled extParsedEnt. [78] extParsedEnt ::= TextDecl? content

ctx: the existing parsing context
URL: the URL for the entity to load
ID: the System ID for the entity to load
lst: the return value for the set of parsed nodes
Returns: 0 if the entity is well formed, -1 in case of args problem and the parser error code otherwise

Function: xmlParseDTD

Load and parse an external subset.

ExternalID: a NAME* containing the External ID of the DTD
SystemID: a NAME* containing the URL to the DTD
Returns: the resulting xmlDtdPtr or NULL in case of error.

Function: xmlParseDoc

parse an XML in-memory document and build a tree.

cur: a pointer to an array of xmlChar
Returns: the resulting document tree

Function: xmlParseDocument

parse an XML document (and build a tree if using the standard SAX interface). [1] document ::= prolog element Misc* [22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?

ctxt: an XML parser context
Returns: 0, -1 in case of error. the parser context is augmented as a result of the parsing.

Function: xmlParseEntity

parse an XML external entity out of context and build a tree. [78] extParsedEnt ::= TextDecl? content This correspond to a «Well Balanced» chunk


filename: the filename
Returns: the resulting document tree

Function: xmlParseExtParsedEnt

parse a general parsed entity An external general parsed entity is well-formed if it matches the production labeled extParsedEnt. [78] extParsedEnt ::= TextDecl? content

ctxt: an XML parser context
Returns: 0, -1 in case of error. the parser context is augmented as a result of the parsing.

Function: xmlParseExternalEntity

Parse an external general entity An external general parsed entity is well-formed if it matches the production labeled extParsedEnt. [78] extParsedEnt ::= TextDecl? content

Илон Маск рекомендует:  Класс для отправки сообщений по e mail
doc: the document the chunk pertains to
sax: the SAX handler bloc (possibly NULL)
user_data: The user data returned on SAX callbacks (possibly NULL)
depth: Used for loop detection, use 0
URL: the URL for the entity to load
ID: the System ID for the entity to load
lst: the return value for the set of parsed nodes
Returns: 0 if the entity is well formed, -1 in case of args problem and the parser error code otherwise

Function: xmlParseFile

parse an XML file and build a tree. Automatic support for ZLIB/Compress compressed document is provided by default if found at compile-time.

filename: the filename
Returns: the resulting document tree if the file was wellformed, NULL otherwise.

Function: xmlParseMemory

parse an XML in-memory block and build a tree.

buffer: an pointer to a char array
size: the size of the array
Returns: the resulting document tree

Function: xmlParserAddNodeInfo

Insert node info record into the sorted sequence

ctxt: an XML parser context
info: a node info sequence pointer

Function: xmlParserFindNodeInfo

Find the parser node info struct for a given node

ctx: an XML parser context
node: an XML node within the tree
Returns: an xmlParserNodeInfo block pointer or NULL

Function: xmlParserFindNodeInfoIndex

xmlParserFindNodeInfoIndex : Find the index that the info record for the given node is or should be at in a sorted sequence

seq: a node info sequence pointer
node: an XML node pointer
Returns: a long indicating the position of the record

Function type: xmlParserInputDeallocate

Callback for freeing some parser input allocations.

str: the string to deallocate

Function: xmlParserInputGrow

This function increase the input for the parser. It tries to preserve pointers to the input buffer, and keep already read data

in: an XML parser input
len: an indicative size for the lookahead
Returns: the number of xmlChars read, or -1 in case of error, 0 indicate the end of this entity

Function: xmlParserInputRead

This function refresh the input for the parser. It doesn’t try to preserve pointers to the input buffer, and discard already read data

in: an XML parser input
len: an indicative size for the lookahead
Returns: the number of xmlChars read, or -1 in case of error, 0 indicate the end of this entity

Function: xmlPedanticParserDefault

Set and return the previous value for enabling pedantic warnings.

val: int 0 or 1
Returns: the last value for 0 for no substitution, 1 for substitution.

Function: xmlReadDoc

parse an XML in-memory document and build a tree.

cur: a pointer to a zero terminated string
URL: the base URL to use for the document
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlReadFd

parse an XML from a file descriptor and build a tree. NOTE that the file descriptor will not be closed when the reader is closed or reset.

fd: an open file descriptor
URL: the base URL to use for the document
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlReadFile

parse an XML file from the filesystem or the network.

filename: a file or URL
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlReadIO

parse an XML document from I/O functions and source and build a tree.

ioread: an I/O read function
ioclose: an I/O close function
ioctx: an I/O handler
URL: the base URL to use for the document
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlReadMemory

parse an XML in-memory document and build a tree.

buffer: a pointer to a char array
size: the size of the array
URL: the base URL to use for the document
encoding: the document encoding, or NULL
options: a combination of xmlParserOption
Returns: the resulting document tree

Function: xmlRecoverDoc

parse an XML in-memory document and build a tree. In the case the document is not Well Formed, a tree is built anyway

cur: a pointer to an array of xmlChar
Returns: the resulting document tree

Function: xmlRecoverFile

parse an XML file and build a tree. Automatic support for ZLIB/Compress compressed document is provided by default if found at compile-time. In the case the document is not Well Formed, a tree is built anyway

filename: the filename
Returns: the resulting document tree

Function: xmlRecoverMemory

parse an XML in-memory block and build a tree. In the case the document is not Well Formed, a tree is built anyway

buffer: an pointer to a char array
size: the size of the array
Returns: the resulting document tree

Function: xmlSAXParseDTD

Load and parse an external subset.

sax: the SAX handler block
ExternalID: a NAME* containing the External ID of the DTD
SystemID: a NAME* containing the URL to the DTD
Returns: the resulting xmlDtdPtr or NULL in case of error.

Function: xmlSAXParseDoc

parse an XML in-memory document and build a tree. It use the given SAX function block to handle the parsing callback. If sax is NULL, fallback to the default DOM tree building routines.

sax: the SAX handler block
cur: a pointer to an array of xmlChar
recovery: work in recovery mode, i.e. tries to read no Well Formed documents
Returns: the resulting document tree

Function: xmlSAXParseEntity

parse an XML external entity out of context and build a tree. It use the given SAX function block to handle the parsing callback. If sax is NULL, fallback to the default DOM tree building routines. [78] extParsedEnt ::= TextDecl? content This correspond to a «Well Balanced» chunk

sax: the SAX handler block
filename: the filename
Returns: the resulting document tree

Function: xmlSAXParseFile

parse an XML file and build a tree. Automatic support for ZLIB/Compress compressed document is provided by default if found at compile-time. It use the given SAX function block to handle the parsing callback. If sax is NULL, fallback to the default DOM tree building routines.

sax: the SAX handler block
filename: the filename
recovery: work in recovery mode, i.e. tries to read no Well Formed documents
Returns: the resulting document tree

Function: xmlSAXParseFileWithData


parse an XML file and build a tree. Automatic support for ZLIB/Compress compressed document is provided by default if found at compile-time. It use the given SAX function block to handle the parsing callback. If sax is NULL, fallback to the default DOM tree building routines. User data (void *) is stored within the parser context in the context’s _private member, so it is available nearly everywhere in libxml

sax: the SAX handler block
filename: the filename
recovery: work in recovery mode, i.e. tries to read no Well Formed documents
data: the userdata
Returns: the resulting document tree

Function: xmlSAXParseMemory

parse an XML in-memory block and use the given SAX function block to handle the parsing callback. If sax is NULL, fallback to the default DOM tree building routines.

sax: the SAX handler block
buffer: an pointer to a char array
size: the size of the array
recovery: work in recovery mode, i.e. tries to read not Well Formed documents
Returns: the resulting document tree

Function: xmlSAXParseMemoryWithData

parse an XML in-memory block and use the given SAX function block to handle the parsing callback. If sax is NULL, fallback to the default DOM tree building routines. User data (void *) is stored within the parser context in the context’s _private member, so it is available nearly everywhere in libxml

sax: the SAX handler block
buffer: an pointer to a char array
size: the size of the array
recovery: work in recovery mode, i.e. tries to read no Well Formed documents
data: the userdata
Returns: the resulting document tree

Function: xmlSAXUserParseFile

parse an XML file and call the given SAX handler routines. Automatic support for ZLIB/Compress compressed document is provided

sax: a SAX handler
user_data: The user data returned on SAX callbacks
filename: a file name
Returns: 0 in case of success or a error number otherwise

Function: xmlSAXUserParseMemory

A better SAX parsing routine. parse an XML in-memory buffer and call the given SAX handler routines.

sax: a SAX handler
user_data: The user data returned on SAX callbacks
buffer: an in-memory XML document input
size: the length of the XML document in bytes
Returns: 0 in case of success or a error number otherwise

Function: xmlSetExternalEntityLoader

Changes the defaultexternal entity resolver function for the application

f: the new entity resolver function

Function: xmlSetFeature

Change the current value of one feature of this parser instance

ctxt: an XML/HTML parser context
name: the feature name
value: pointer to the location of the new value
Returns: -1 in case or error, 0 otherwise

Function: xmlSetupParserForBuffer

Setup the parser context to parse a new buffer; Clears any prior contents from the parser context. The buffer parameter must not be NULL, but the filename parameter can be

ctxt: an XML parser context
buffer: a xmlChar * buffer
filename: a file name

Function: xmlStopParser

Blocks further parser processing

ctxt: an XML parser context

Function: xmlSubstituteEntitiesDefault

Set and return the previous value for default entity support. Initially the parser always keep entity references instead of substituting entity values in the output. This function has to be used to change the default parser behavior SAX::substituteEntities() has to be used for changing that on a file by file basis.

libxml++ — An XML Parser for C++

Murray Cumming

This is an introduction to libxml’s C++ binding, with simple examples.

Table of Contents

libxml++

libxml++ is a C++ API for the popular libxml XML parser, written in C. libxml is famous for its high performance and compliance to standard specifications, but its C API is quite difficult even for common tasks.

libxml++ presents a simple C++-like API that can achieve common tasks with less code. Unlike some other C++ parsers, it does not try to avoid the advantages of standard C++ features such as namespaces, STL containers or runtime type identification, and it does not try to conform to standard API specifications meant for Java. Therefore libxml++ requires a fairly modern C++ compiler such as g++ 3.

But libxml++ was created mainly to fill the need for an API-stable and ABI-stable C++ XML parser which could be used as a shared library dependency by C++ applications that are distributed widely in binary form. That means that installed applications will not break when new versions of libxml++ are installed on a user’s computer. Gradual improvement of the libxml++ API is still possible via non-breaking API additions, and new independent versions of the ABI that can be installed in parallel with older versions. These are the general techniques and principles followed by the GNOME project, of which libxml++ is a part.

Installation

libxml++ is packaged by major Linux and *BSD distributions and can be installed from source on Linux and Windows, using any modern compiler, such as g++, SUN Forte, or MSVC++.

For instance, to install libxml++ and its documentation on debian, use apt-get or synaptic like so:

To check that you have the libxml++ development packages installed, and that your environment is working properly, try pkg-config libxml++-2.6 —modversion .

The source code may be downloaded from libxmlplusplus.sourceforge.net . libxml++ is licensed under the LGPL, which allows its use via dynamic linking in both open source and closed-source software. The underlying libxml library uses the even more generous MIT licence.

UTF-8 and Glib::ustring

The libxml++ API takes, and gives, strings in the UTF-8 Unicode encoding, which can support all known languages and locales. This choice was made because, of the encodings that have this capability, UTF-8 is the most commonly accepted choice. UTF-8 is a multi-byte encoding, meaning that some characters use more than 1 byte. But for compatibility, old-fashioned 7-bit ASCII strings are unchanged when encoded as UTF-8, and UTF-8 strings do not contain null bytes which would cause old code to misjudge the number of bytes. For these reasons, you can store a UTF-8 string in a std::string object. However, the std::string API will operate on that string in terms of bytes, instead of characters.

Because Standard C++ has no string class that can fully handle UTF-8, libxml++ uses the Glib::ustring class from the glibmm library. Glib::ustring has almost exactly the same API as std::string, but methods such as length() and operator[] deal with whole UTF-8 characters rather than raw bytes.

There are implicit conversions between std::string and Glib::ustring, so you can use std::string wherever you see a Glib::ustring in the API, if you really don’t care about any locale other than English. However, that is unlikely in today’s connected world.

glibmm also provides useful API to convert between encodings and locales.

Compilation and Linking

To use libxml++ in your application, you must tell the compiler where to find the include headers and where to find the libxml++ library. libxml++ provides a pkg-config .pc file to make this easy. For instance, the following command will provide the necessary compiler options: pkg-config libxml++-2.6 —cflags —libs

Libxml2: Everything You Need in an XML Library

WEBINAR:
On-Demand

Desktop-as-a-Service Designed for Any Cloud ? Nutanix Frame

Libxml2 is the XML parser and toolkit written in the C language and is freely available for integration into your apps via the easy-to-digest MIT License. Libxml2 was originally developed for the Gnome project, but doesn’t have any dependencies on it or even the Linux platform. This tool is known to be highly portable and is in use by many teams on Linux, Unix, Win32/Win64, Cygwin, MacOS, MacOS/X, and most other platforms, including embedded systems. Even though Libxml2 was written in C, there are an abundance of language bindings available including bindings for Python, Perl, C++, C#, PHP, Pascal, Ruby, and Tcl.

Илон Маск рекомендует:  Что такое код ob_clean

As you know, XML itself is a metalanguage used to design markup languages. That is to say, it is a grammar where semantics and structure are added to the content using extra «markup» information enclosed between angle brackets » «. HTML certainly is the most well-known markup language and the specification of HTML 4.0 can be fully articulated using an XML Document Type Definition (DTD).

Of course, just saying something is an XML parser doesn’t imply all that much. You have to enumerate both how much and what you’re going to support. As such, Libxml2 implements a number of existing standards related to markup languages. I won’t bore you with the whole laundry list, but the majors are: XML standard 1.0 including Namespaces, Base, URI, XPointer, XInclude, XPath, HTML 4.0 parser, Canonical XML 1.0, XML Schemas Part 2, xml:id, and XML Catalog working drafts. In most cases, libxml2 tries to implement the specifications in a relatively strictly compliant way. Libxml2 has passed all 1800+ tests from the OASIS XML Testsuite.

XML documents aren’t always sitting around on your local filesystem for perusal, so Libxml2 includes basic FTP and HTTP clients so you don’t have to write an extra layer of code just to find your documents. Libxml2 exports Push (progressive) and Pull (blocking) type parser interfaces for both XML and HTML. Libxml2 can do DTD validation at parse time, using a parsed document instance, or with an arbitrary DTD. Sister projects provide some additional goodies like XSLT 1.0 (from libxslt) and a DOM2 implementation is also in the works.

Let’s Get This Parser Started!

Although you are certainly welcome to recompile the source to meet your own project requirement quirks, I found the simplest way to get parsing was through Igor Zlatkovic’s dedicated libxml Win32 resource page. In addition to DLL downloads, you will also find C#, Perl (Apache), and Pascal language bindings at the bottom of Zlatkovic’s page. Zlatkovic has packaged Libxml2 and related tools so you can simply take the subset you really need:

  • libxml2, the XML parser and processor
  • libxslt, the XSL and EXSL Transformations processor
  • xmlsec, the XMLSec and XMLDSig processor
  • xsldbg, the XSL Transformations debugger
  • openssl, the general crypto toolkit
  • iconv, the character encoding toolkit
  • zlib, the compression toolkit

Figure 1: libxml package dependencies

For example, libxml depends on iconv and zlib. If you run the included xmllint.exe or xmlcatalog.exe, you simply will discover that you need iconv.dll (as promised in the dependency chart). Be advised that Zlatkovic’s downloads don’t include the sample programs and data files, however.

For purposes of this article, I used Zlatkovic’s distribution of libxml2 2.6.30+, iconv 1.9.2., and zlib 1.2.3.

How to Parse a Tree with Libxml2

You’ll only look at the Document Object Model (DOM) parser because that is inherently more complex than the Simple API for XML (SAX). As you recall, the DOM model gives you complete tree navigation at the cost of maintaining the whole XML file in memory. If you’re not modifying the tree as it’s being parsed, SAX can be significantly less overhead.

Figure 2: DOM tree example

Specifically, you’ll dissect the tree1.c example program and identify some common programming paradigms used. The purpose of this program is to parse a file to a tree, use xmlDocGetRootElement() to get the root element, and then walk the document and print all the element names in document order. This is about the easiest non-trivial sort of thing you can do in XML. For simplicity’s sake, you’ll assume that the XML file you want to parse is the first argument on the command line and output will go to stdout (console). Program listing follows:

Libxml2: Everything You Need in an XML Library

WEBINAR:
On-Demand

Desktop-as-a-Service Designed for Any Cloud ? Nutanix Frame

To run the program, you assume that libxml.dll, iconv.dll, and zlib.dll are all locatable in the path or current directory. To compile your test program, you use the following command line:


You feed it the test data file as shown below as input:

Which yields the following as output:

The program starts on Line #17 in the main() function. The LIBXML_TEST_VERSION is a safety check to make sure the libxml.dll you are using is in fact compatible with the version you are compiled for (in other words, the headers you used).

The actual parsing takes place inside xmlReadFile() on Line #27, which returns an xmlDoc object if successful. The first parameter is the local filename or an HTTP document path (URL). The second parameter refers to the encoding, which defaults to NONE. The last parameter is a concatenation of option flags which are one or more of the following:

XML_PARSE_RECOVER Recover on errors
XML_PARSE_NOENT Substitute entities
XML_PARSE_DTDLOAD Load the external subset
XML_PARSE_DTDATTR Default DTD attributes
XML_PARSE_DTDVALID Validate with the DTD
XML_PARSE_NOERROR Suppress error reports
XML_PARSE_NOWARNING Suppress warning reports
XML_PARSE_PEDANTIC Pedantic error reporting
XML_PARSE_NOBLANKS Remove blank nodes
XML_PARSE_SAX1 Use the SAX1 interface internally
XML_PARSE_XINCLUDE Implement xinclude substitution
XML_PARSE_NONET Forbid network access
XML_PARSE_NODICT Do not reuse the context dictionary
XML_PARSE_NSCLEAN Remove redundant namespaces declarations
XML_PARSE_NOCDATA Merge CDATA as text nodes
XML_PARSE_NOXINCNODE Do not generate XINCLUDE START/END nodes
XML_PARSE_COMPACT Compact small text nodes; no modification of the tree allowed afterwards (will possibly crash if you try to modify the tree)

Most notably, XML_PARSE_COMPACT can make up for some of the memory performance hitS that DOM parsers are known for if you need the XML tree only for read-only purposes. Note also the ability to turn on DTD validation at this point as well.

Next, in Line #33, you call xmlDocGetRootElement(doc) which, as you would expect, gives you the top of the tree that you then can traverse easily using the recursive print_element_names() function. Inside print_element_names(), Lines #5-15, there are only two things to do: Either the node is an XML_ELEMENT_NODE and you print itl otherwise, you call yourself again, this time with the children of the current node. There are actually 21 different node types, so it’s worth seeing the complete list of choices:

XML_ELEMENT_NODE = 1
XML_ATTRIBUTE_NODE = 2
XML_TEXT_NODE = 3
XML_CDATA_SECTION_NODE = 4
XML_ENTITY_REF_NODE = 5
XML_ENTITY_NODE = 6
XML_PI_NODE = 7
XML_COMMENT_NODE = 8
XML_DOCUMENT_NODE = 9
XML_DOCUMENT_TYPE_NODE = 10
XML_DOCUMENT_FRAG_NODE = 11
XML_NOTATION_NODE = 12
XML_HTML_DOCUMENT_NODE = 13
XML_DTD_NODE = 14
XML_ELEMENT_DECL = 15
XML_ATTRIBUTE_DECL = 16
XML_ENTITY_DECL = 17
XML_NAMESPACE_DECL = 18
XML_XINCLUDE_START = 19
XML_XINCLUDE_END = 20
XML_DOCB_DOCUMENT_NODE = 21

In addition to the «type» value, the xmlNode contains all the critical information about each node in the tree, including navigational pointers (next, prev parent, children), pointers to the namespace, properties list, node name, and of course the content itself (if any).

Conclusion

In an introductory article such as this, you can only hope to scratch the surface of what a versatile tool such as libxml2 can do for you. Libxml2 supports DTD, Schemas, XPath, internationalization support, and lots more that can make your application XML standards-compliant. As mentioned before, libxml2 comes with many language bindings so you can work in C, C++, C#, Python, Perl, or whatever you need to get the job done. Best of all, it’s freely available to integrate into your apps today.

Как использовать libxml2 для анализа данных из XML?

Я просмотрел образцы кода libxml2, и я смущен тем, как собрать их все вместе.

Каковы шаги, необходимые при использовании libxml2 для простого анализа или извлечения данных из XML файла?

Я хотел бы получить и, возможно, сохранить информацию для определенных атрибутов. Как это делается?

Я считаю, что вам сначала нужно создать дерево разбора. Может быть, эта статья может помочь, просмотрите раздел, в котором говорится Как проанализировать дерево с Libxml2.

Я нашел эти два ресурса полезными, когда я учился использовать libxml2 для создания синтаксического анализатора rss.

Учебник с использованием дерева DOM (пример кода для включения значения атрибута)

libxml2 предоставляет различные примеры, показывающие базовое использование.

Для ваших заявленных целей, tree1.c, вероятно, будет наиболее актуальным.

tree1.c: перемещает дерево для печати имена элементов

Разберите файл с деревом, используйте xmlDocGetRootElement(), чтобы получить корень элемент, затем пройдите документ и распечатать все имя элемента в документе порядок.

Когда у вас есть структура xmlNode для элемента, член «properties» является связанным списком атрибутов. Каждый объект xmlAttr имеет объект «name» и «children» (который является именем/значением для этого атрибута, соответственно) и «следующий» элемент, который указывает на следующий атрибут (или нулевой для последнего).

Здесь я упомянул полный процесс извлечения данных XML/HTML из файла на платформе Windows.

    Сначала загрузите предварительно скомпилированную .dll форму http://xmlsoft.org/sources/win32/

Также загрузите свою зависимость iconv.dll и zlib1.dll с той же страницы

Извлеките все .zip файлы в один каталог. Для примера: D:\demo\

libxml2 sax parser — извлечение текстовых узлов

Я хотел извлечь значения из текстовых узлов из ввода XML. Я получил следующий код из Интернета, так как в официальной документации libxml есть много неработающих ссылок, из которых является синтаксический анализатор саксофона. Пожалуйста, помогите мне получить значение текстового узла. в startElementNs, когда я пытался найти свой текстовый узел, я получаю NULL. Цените любую помощь здесь.

Мой XML выглядит так:

Мой код выглядит так:

Решение

Вам нужно использовать символы обратного вызова

пустые символы (void * user_data,
const xmlChar * ch,
int len);

Строки не заканчиваются нулем, вам нужно использовать ch, len для определения строки

Другая проблема с этим обратным вызовом состоит в том, что он может вызываться несколько раз между начальным и конечным элементом. Таким образом, вы не можете слепо предполагать, что в ответ вы получите строку между тегом. Вам может понадобиться использовать построитель строк или что-то еще для сбора строк.

При обратном вызове вы, вероятно, захотите скопировать символы в
какой-то другой буфер, чтобы его можно было использовать из обратного вызова endElement.
Чтобы немного оптимизировать этот обратный вызов, вы можете настроить обратный вызов так, чтобы
он копирует только символы, если анализатор находится в определенном состоянии.
Обратите внимание, что обратный вызов символов может вызываться более одного раза между
вызовы startElement и endElement.

Надеюсь, что это ответит вам, даже если покойные могут получить помощь

XML and Gnome libXML2:

This tutorial covers the use of an XML config file and parsing the information using the Gnome libXML2 API.

Related YoLinux Tutorials:

The eXtensible Markup Language (XML) was created to store and define complex, hiearchically structured data for exchange and storage. The XML structure begins with it’s hiearchy at a root node and branches from this document root.

The Document Type Definition (DTD) is optional and defines the data to be presented in an XML document. It is often used to verify the data for completness and adherance to rules.

XML Schema (XSD) is a newer and more complete data definition with definable types. XSD will be competing with DTD as the format for data definition especially when defining complex relationships and data types.

XML parsers fall into three major catagories:

  1. DOM: Import/parse all data into a data structure in memory for query. The data is held as nodes in a data tree which can be traversed. While this is often easier to program than SAX invocations, it uses more memory and runs slower.
  2. SAX: Parse on the fly to look for the data requested. This is event driven where callbacks are invoked as elements are encountered during parsing. Programmer writes callbacks. A custom class is written for each document. This is considered to be the fastest way to parse a file.
  3. Xpath: (XML Path) Search data with regular expression. Very easy to use. Usage is similar to a query with regular expression. A node list is returned which matches the Xpath expression. It is usually implemented as an extension to DOM.

Number of children:

  • ? Only one element permitted.
  • * allows for zero or multiple elements i.e.:
  • + At least one or many elements permitted.
CDATA #REQUIRED
CDATA #IMPLIED
CDATA Character Data
PCDATA Parsed character Data
NMTOKEN No whitespaces.
NMTOKENS One or more name tokens separated by white space
ENUMERATION i.e.
ENTITY
ENTITTIES
ID XML name specified:
xml_name2 is required.
IDREF attribute refers to an ID
IDREFS
NOTATION
  • XML names may include _-.
  • When HTML text is included use and » to repressent , and » respectively.

Note: The DTD is not required for use with the Gnome LibXml2 API. If using this API to generate XML, the DTD will not be generated.

Prerequisite (RPM) packages: pkgconfig, libxml2-devel, gnome-libs-devel

Compile: gcc -g -Wall `xml2-config --cflags --libs` `gnome-config --cflags --libs gnome gnomeui xml` -o testLibXml2 testLibXml2.c

[Potential Pitfall] : The order of the directory paths referenced matters. Reference the libxml2 include path directories before the gnome directory paths. The following will result in a compilation error:

  • LibXML: xml2-config --cflags --libs
    (Reference this first.)
  • Gtk: pkg-config --cflags --libs gtk+-2.0
  • Gnome: gnome-config --cflags --libs gnome gnomeui xml

Подключение библиотеки libxml2 к Builder2009

Запись дорабатывается

Скачать можно здесь:

Но также придётся дополнительно скачать некоторые dll-ки:

1. libxml2.dll ( почему её нет в библиотеке не пойму)
2. iconv.dll ( можно содрать также отсюда http://www.cyberforum.ru/blogs/131347/blog533.html )
3. zlib1.dll

dll-ки легко найти в интернете.

В проект надо подключать libxml2-bcc.lib ( OMF -формата как раз для Builder-а)
О том как можно подключить lib-файлы можно почитать тут:

В опциях проекта необходимо указать путь к папке проекта:

В качестве html-файла взята сохраненная страничка форума.

После нажатия кнопки в браузере должен открыться созданый файл с алертом.

На заметку блог о другом Си-шном парсере html:

Понравилась статья? Поделиться с друзьями:
Кодинг, CSS и SQL